5420 Anomaly Detection, Fall 2020

Assignment 7: Unsupervised Machine Learning II

Submitted by: Harsh Dhanuka, hd2457

Objectives

It is important in any data science project to define the objective as specific as possible. Below let's write it from general to specific. This will direct your analysis.

  • Identify hospitals that may abuse the resources or conduct fraud.
  • Identify hospitals that may abuse the resources or conduct fraud compared to its peers.
  • Identify hospitals that may abuse the resources or conduct fraud compared to the average (mean, median, etc) of its peers.
  • Identify hospitals that may abuse the resources or conduct fraud compared to the average (mean, median, etc) of its peers of the same DRG and State.
  • Identify hospitals that may abuse the resources or conduct fraud through different clustering techniques - Autoencoders, and iForest
  • Identify hospitals that may abuse the resources or conduct fraud through different clustering techniques - Autoencoders, and iForest, by tuning the hyper paramters, and using the 'Average' Aggregate method to cross validate and build model stability.

Please click Section 5 to directly go to the Autoencoders, and iForest analysis.

Table of Contents

Section 1: Initial Steps

Section 2: Data Cleaning and Preparation

Section 3: EDA of all variables

I have commented this EDA section as detailed EDA and all the graphs was already shown in previous 2 submissions.

Section 4: Feature Engineering

Section 5: Autoencoders, and iForest

Understanding the two methods:

Autoencoders

An autoencoder is a special type of neural network that copies the input values to the output values. It does not require the target variable like the conventional Y, thus it is categorized as unsupervised learning.

Indeed, we are not so much interested in the output layer. We are interested in the hidden core layer. If the number of neurons in the hidden layers is less than that of the input layers, the hidden layers will extract the essential information of the input values. This condition forces the hidden layers to learn the most patterns of the data and ignore the “noises”. So in an autoencoder model, the hidden layers must have fewer dimensions than those of the input or output layers. If the number of neurons in the hidden layers is more than those of the input layers, the neural network will be given too much capacity to learn the data. In an extreme case, it could just simply copy the input to the output values, including noises, without extracting any essential information.

iForest

Isolation Forest ot iForest is an anomaly detection algorithm created by Fei Tony Liu et al. They argue that most of the existing approaches to anomaly detection find the norm first, then identify observations that do not conform to the norm. They propose the Isolation Forest as an alternative approach — explicitly isolating anomalies instead of profiling normal data points. Anomalies are isolated closer to the root of the tree; whereas normal points are isolated at the deeper end of the tree. They call each tree the Isolation Tree or iTree. This isolation characteristic of tree forms the basis to detect anomalies.

When we model the data with an iTree, outliers usually have short path lengths on a tree. On the other hands, the bulb of the normal observations requires many tree branches. So the number of depths becomes a good proxy for the anomaly scores. It is important to note that this iTree algorithm is different from the decision tree algorithm because iTree does not use a target variable to train the tree. It is an unsupervised learning method.

1. Initial Steps

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline
import scipy
import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")
import missingno as msno

from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from pprint import pprint
import zipcodes

import plotly
import plotly.express as px

from pyod.models.iforest import IForest
from pyod.models.auto_encoder import AutoEncoder
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from pyod.utils.data import generate_data, get_outliers_inliers
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

1.1. Load Data

In [2]:
# Read the data

df = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 6 PyOD/inpatientCharges.csv')
df.head()
Out[2]:
DRG Definition Provider Id Provider Name Provider Street Address Provider City Provider State Provider Zip Code Hospital Referral Region Description Total Discharges Average Covered Charges Average Total Payments Average Medicare Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 $5777.24 $4763.73
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 $15131.85 $5787.57 $4976.71
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 $37560.37 $5434.95 $4453.79
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 $13998.28 $5417.56 $4129.16
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 $31633.27 $5658.33 $4851.44

1.2. Basic Summary Check

In [3]:
# Rows and  Columns

print(" ")
print("Number of rows and columns in the dataset:")
df.shape
 
Number of rows and columns in the dataset:
Out[3]:
(163065, 12)
In [4]:
print(" ")
print("Basic statistics of the numerical columns are as follows:")

# Check basic statistics
df.describe()
 
Basic statistics of the numerical columns are as follows:
Out[4]:
Provider Id Provider Zip Code Total Discharges
count 163065.000000 163065.000000 163065.000000
mean 255569.865428 47938.121908 42.776304
std 151563.671767 27854.323080 51.104042
min 10001.000000 1040.000000 11.000000
25% 110092.000000 27261.000000 17.000000
50% 250007.000000 44309.000000 27.000000
75% 380075.000000 72901.000000 49.000000
max 670077.000000 99835.000000 3383.000000
In [5]:
# Check the column dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163065 entries, 0 to 163064
Data columns (total 12 columns):
DRG Definition                          163065 non-null object
Provider Id                             163065 non-null int64
Provider Name                           163065 non-null object
Provider Street Address                 163065 non-null object
Provider City                           163065 non-null object
Provider State                          163065 non-null object
Provider Zip Code                       163065 non-null int64
Hospital Referral Region Description    163065 non-null object
 Total Discharges                       163065 non-null int64
 Average Covered Charges                163065 non-null object
 Average Total Payments                 163065 non-null object
Average Medicare Payments               163065 non-null object
dtypes: int64(3), object(9)
memory usage: 14.9+ MB

Considerations from eyeballing the data:

  1. Check for categorical/object fields from the data variable descriptions. Convert the relevant numeric fields to their respective categorical/object fields: Provider Id

  2. The Zipcode column is represented as an integer. Convert it to zipcode format.

  3. Variable Hospital Referral Region Description comprises of the State and the city, which I see is the nearest metro city.

  4. Average Covered Charges is not significant for our analysis, it will be for other purposes such claims fraud, insurance premiums, etc.

  5. The two payments columns need to be converted to proper numeric formats.

In [6]:
# Basic Sort of Provider ID and DRG Definition

df = df.sort_values(['Provider Id', 'DRG Definition'])
df.head(2)
Out[6]:
DRG Definition Provider Id Provider Name Provider Street Address Provider City Provider State Provider Zip Code Hospital Referral Region Description Total Discharges Average Covered Charges Average Total Payments Average Medicare Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 $5777.24 $4763.73
1079 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 $4894.76 $3865.50

2. Data Cleaning and Preparation

2.1. Rename all column names

The given column names have a lot of spaces, trailing spaces, etc. I will rename all the columns as per appropriate naming convention.

In [7]:
df.columns = ['DRG_Definition', 'Provider_Id', 'Provider_Name',
       'Provider_Street_Address', 'Provider_City', 'Provider_State',
       'Provider_Zip_Code', 'Hospital_Referral_Region_Description',
       'Total_Discharges', 'Average_Covered_Charges',
       'Average_Total_Payments', 'Average_Medicare_Payments']
df.columns
Out[7]:
Index(['DRG_Definition', 'Provider_Id', 'Provider_Name',
       'Provider_Street_Address', 'Provider_City', 'Provider_State',
       'Provider_Zip_Code', 'Hospital_Referral_Region_Description',
       'Total_Discharges', 'Average_Covered_Charges', 'Average_Total_Payments',
       'Average_Medicare_Payments'],
      dtype='object')

2.2. Convert amount variables to appropriate numeric format, strip off the dollar symbol

In [8]:
df["Average_Total_Payments"] = df["Average_Total_Payments"].str[1:].astype(float)
df["Average_Medicare_Payments"] = df["Average_Medicare_Payments"].str[1:].astype(float)
df["Average_Covered_Charges"] = df["Average_Covered_Charges"].str[1:].astype(float)

2.3. Convert Provider_Id to appropriate object format

In [9]:
df["Provider_Id"] = df["Provider_Id"].astype(object)
In [10]:
df.head(2)
Out[10]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73
1079 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 4894.76 3865.50

2.4. Convert Provided_Zip_Code from integer to appropriate Zipcode format

In [11]:
# Zipcode to 5 character integer zipcode format

df['Provider_Zip_Code'] = df['Provider_Zip_Code'].astype(str).str.zfill(5)

2.5. Check NA's

In [12]:
df.isnull().sum().sum()
Out[12]:
0

There are no NA's, which is good for our analysis.

2.6. Add a new "states" dataset to match 'Regions' with the exising Provider_State variable.

Regions will be a a very useful feature when performing the Exploratory Data Analysis.

In [13]:
states = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 6 PyOD/states.csv')
states.head(5)
Out[13]:
State State Code Region Division
0 Alaska AK West Pacific
1 Alabama AL South East South Central
2 Arkansas AR South West South Central
3 Arizona AZ West Mountain
4 California CA West Pacific
In [14]:
# Left join the new dataset

df = pd.merge(left = df, right = states, left_on = 'Provider_State', right_on = 'State Code', how = 'left')
df.head(2)
Out[14]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments State State Code Region Division
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73 Alabama AL South East South Central
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 4894.76 3865.50 Alabama AL South East South Central
In [15]:
# Remove duplicate 'state' column

df = df.drop(columns = ['State', 'State Code'])

2.7. Add 'Median Income' by zipcode dataset to match by zipcode in original dataset

Dataset source: https://www.psc.isr.umich.edu/dis/census/Features/tract2zip/

This has the zipcode wise mean and median income data for 2006 to 2010

In [16]:
income_df = pd.read_excel('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 6 PyOD/MedianZIP-3.xlsx')
income_df['Zip'] = income_df['Zip'].astype(str).str.zfill(5)
income_df.head(3)
Out[16]:
Zip Median Mean Pop
0 01001 56662.5735 66687.8 16445
1 01002 49853.4177 75062.6 28069
2 01003 28462.0000 35121 8491
In [17]:
income_df.isnull().sum()
Out[17]:
Zip       0
Median    0
Mean      0
Pop       0
dtype: int64
In [18]:
df = pd.merge(left = df, right = income_df, left_on = 'Provider_Zip_Code', right_on = 'Zip', how = 'left')
df = df.drop(columns = ['Zip','Mean'])

df.rename(columns={'Median':'Zip_Median_Income',
                   'Pop':'Zip_Population'}, inplace=True)

df.head(2)
Out[18]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0

.

-------------------------------------- SECTION BREAK ------------------------------------

.

3. EDA

3.1. Showing the Distribution of X

3.1.1. DRG_Definition Distribution

Explore the total number of DRG Definitions, and the count of how many times they appear.

In [19]:
df_count = df['DRG_Definition'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['DRG_Definition','Count']
df_count.head()
Out[19]:
DRG_Definition Count
0 194 - SIMPLE PNEUMONIA & PLEURISY W CC 3023
1 690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC 2989
2 292 - HEART FAILURE & SHOCK W CC 2953
3 392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DIS... 2950
4 641 - MISC DISORDERS OF NUTRITION,METABOLISM,F... 2899
In [20]:
fig = px.bar(df_count, x = 'DRG_Definition', y = 'Count', color = 'DRG_Definition',
             width=1450, height=500,
            title = "Distribution of DRG Definitions")
fig.show()

Observation:

The DRG Definition has a seemingly good distribution. The counts of DRG Definitons range from around 3000 to 600. All other DRG Definition counts lie within this range.

Here, any DRG Definition count doesnt seem like an outlier and all behave normally.

Explore the total number Provider Names and the count of how many times each one appears.

In [21]:
df_count = df['Provider_Name'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_Name','Count']
df_count.head()
Out[21]:
Provider_Name Count
0 GOOD SAMARITAN HOSPITAL 633
1 ST JOSEPH MEDICAL CENTER 427
2 MERCY MEDICAL CENTER 357
3 MERCY HOSPITAL 347
4 ST JOSEPH HOSPITAL 343
In [22]:
# Show only those Provider_Names whose total count is above 100

df_count1 = df_count.loc[df_count['Count'] > 100]
fig = px.bar(df_count1, x='Provider_Name', y='Count',
             width=1200, height=500,
             color = 'Provider_Name',
            title = "Distribution of Provider Names, only showing for Count > 100")
fig.show()
In [23]:
# Show only those Provider_Names whose total count is below 3

df_count1 = df_count.loc[df_count['Count'] < 3]
fig = px.bar(df_count1, x='Provider_Name', y='Count',
             width=1200, height=500,
            color = 'Provider_Name',
            title = "Distribution of Provider Names, only showing for Count < 3")
# fig.show()

Observation:

From the above two count charts, it is clear than some Providers are extremely popular, and have around 600 entries. They seem to be the ones who provide services under multiple DRG Definitions.

While, some Providers are very unpopular, and have only 1 entry. Now, this depends on the DRG Definition, as some hospitals be a single specialty hospital, and hence everyone goes there only.

Explore the total number Cities and the count of how many times each one appears.

In [24]:
df_count = df['Provider_City'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_City','Count']
df_count.head()
Out[24]:
Provider_City Count
0 CHICAGO 1505
1 BALTIMORE 1059
2 HOUSTON 950
3 PHILADELPHIA 898
4 BROOKLYN 877
In [25]:
# Show only those Provider_Cities whose total count is above 500

df_count1 = df_count.loc[df_count['Count'] > 500]
fig = px.bar(df_count1, x='Provider_City', y='Count',width=1000, height=500,
            color = 'Provider_City',
            title = "Distribution of Provider Cities, only showing for Count >500")
# fig.show()
In [26]:
# Show only those Provider_Cities whose total count is below 5

df_count1 = df_count.loc[df_count['Count'] < 5]
fig = px.bar(df_count1, x='Provider_City', y='Count',width=1000, height=500,
            color = 'Provider_City',
            title = "Distribution of Provider Cities, only showing for Count > 5")
# fig.show()

Observation:

Chicago is the most popular city with around 1500 entries. There are also a lot of other cities which have only 1 entry.

Explore the total number States and the count of how many times each one appears.

In [27]:
df_count = df['Provider_State'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_State','Count']
df_count.head()
Out[27]:
Provider_State Count
0 CA 13064
1 TX 11864
2 FL 11155
3 NY 9178
4 IL 7909
In [28]:
fig = px.bar(df_count, x='Provider_State', y='Count',
             width=1000, height=500,
             color = 'Provider_State',
            title = "Distribution of Provider State")
# fig.show()

Observation:

The states seem to have a good distribution. There seems to be no outliers or staes requiring special attention.

3.1.5. Average_Total_Payments Distribution

In [29]:
fig = px.histogram(df, x="Average_Total_Payments",
            width=1000, height=500,
            title = "Distribution of Average Total Payments")
# fig.show()
In [30]:
fig = px.box(df, x="Average_Total_Payments",width=1000, height=500,
            title = "Distribution of Average Total Payments")
fig.show()
In [31]:
fig = px.violin(df, y="Average_Total_Payments", box=True, 
                points='all',width=800, height=500,
               title = "Distribution of Average Total Payments")
# fig.show()

Observation:

Most of the Average payments are less than USD 11,000. So, any average payment above that might be a reason for futher investigation.

There are some extreme high values, more than USD 150,000 which may need investigation.

  • Median: 7214.1

3.1.6. Average_Medicare_Payments Distribution

In [32]:
fig = px.histogram(df, x="Average_Medicare_Payments",
            width=1000, height=500,
                  title = "Distribution of Average Medicare Payments")
# fig.show()
In [33]:
fig = px.box(df, x="Average_Medicare_Payments",width=1000, height=500,
            title = "Distribution of Average Medicare Payments")
# fig.show()

Observation:

Most of the Average Medicare payments are less than USD 10,000. So, any average payment above that might be a reason for futher investigation.

There are some extreme high values, more than USD 150,000 which may need investigation.

  • Median: 6158.46

3.1.7. Total_Discharges Distruibution

In [34]:
fig = px.histogram(df, x="Total_Discharges",
            width=800, height=500,
                  title = "Distribution of Total Discharges")
# fig.show()
In [35]:
fig = px.box(df, x="Total_Discharges",width=1000, height=500,
            title = "Distribution of Total Discharges")
# fig.show()

Observation:

Most of the Total Discharges are less than 49.

There are some extreme high values, such as 3383, which may need investigation.

  • Median: 27

3.2. Showing the Distribution of Y by another Categorical Variable X

3.2.1. Average_Total_Payments by DRG_Definition

In [36]:
fig = px.box(df, x="DRG_Definition", y="Average_Total_Payments",width=1400, height=500,
             color = "DRG_Definition",
            title = "The Distribution of Average Total Payments by DRG Definition")
fig.show()

Observation:

Some DRG's have a very high Average Total Payments, these may be critical operations, which bear high cost.

3.2.2. Average_Total_Payments by State

In [37]:
fig = px.box(df, x="Provider_State", y="Average_Total_Payments",width=1000, height=500,
             color = "Provider_State",
            title = "The Distribution of Average Total Payments by Provider State")
# fig.show()

Observation:

The Average Total Payments are more or less similar, but some states such as NY and CA are very expensiove overall.

3.2.3. Average_Total_Payments by Region

In [38]:
fig = px.box(df, x="Region", y="Average_Total_Payments",width=1000, height=500,
             color = "Region",
            title = "The Distribution of Average Total Payments by Region")
# fig.show()
In [39]:
# px.violin(df,x='Average_Total_Payments', y = "Region", color='Region',
#          title = "The Distribution of Average Total Payments by Region",
#          orientation='h').update_traces(side='positive',width=2)

Observation:

The West region seems to be generally high in terms of Total Average Payments. This was verified earlier as we saw the state CA was extremely high as well.

It is followed by Northeast, which includes the state NY.

3.2.4. Total_Discharges by DRG_Definition

In [40]:
fig = px.box(df, x="DRG_Definition", y="Total_Discharges",width=1400, height=500,
             color = "DRG_Definition",
            title = "The Distribution of Total Discharges by DRG Definition")
# fig.show()

Observation:

The Discharge rate for some DRG's is very high, while most others have a balanced discharged rate.

3.2.5. Total_Discharges by Region

In [41]:
fig = px.box(df, x="Region", y="Total_Discharges",width=1000, height=500,
             color = "Region",
            title = "The Distribution of Total Discharges by Region")
# fig.show()
In [42]:
# px.violin(df,x='Total_Discharges', y = "Region", color='Region',
#          title = "The Distribution of Total Discharges by Region",
#          orientation='h').update_traces(side='positive',width=2)

Observation:

Most regions have a similar total discharged pattern. However, the Northeast region has an outlier.

3.3. Showing interaction of two or three variables

3.3.1. See interaction between Average_Total_Payments and Average_Medicare_Payments

In [43]:
fig = px.scatter(df, x="Average_Total_Payments", y="Average_Medicare_Payments",
                 size = "Average_Total_Payments", color = 'Average_Total_Payments',
                size_max=60,width=800, height=600)
# fig.show()

Observation:

As the average total payments increase, the average medicare payments also increase, which shows that there is a very high collinearity between these two variables.

3.3.2. Understand features such as mean, median, min, max, etc of Average_Total_Payments

I will group the entire data by DRG_Definition and then calculate the statistics for each group overall.

In [44]:
agg_columns = ['mean', 'median', 'var', 'std', 'count', 'min', 'max']
groupby_drg = df[['DRG_Definition', 'Average_Total_Payments']].groupby(by='DRG_Definition').agg(agg_columns)

groupby_drg.columns = [header + '-' + agg_column 
                       for header, agg_column in zip(groupby_drg.columns.get_level_values(0), agg_columns)]

groupby_drg.columns = groupby_drg.columns.get_level_values(0)
In [45]:
groupby_drg.reset_index(inplace=True)
groupby_drg['Average_Total_Payments-range'] = groupby_drg['Average_Total_Payments-max'] - groupby_drg['Average_Total_Payments-min']

groupby_drg.head(2)
Out[45]:
DRG_Definition Average_Total_Payments-mean Average_Total_Payments-median Average_Total_Payments-var Average_Total_Payments-std Average_Total_Payments-count Average_Total_Payments-min Average_Total_Payments-max Average_Total_Payments-range
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 6960.534004 6582.89 2.184111e+06 1477.873952 1079 4968.00 18420.56 13452.56
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 6706.276445 6093.75 4.137017e+06 2033.965862 1201 4194.09 25519.43 21325.34
In [46]:
def plt_setup(_plt):
    _plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom='off',      # ticks along the bottom edge are off
    top='off',         # ticks along the top edge are off
    labelbottom='off')

See the 'mean' of the respective DRG_Definition groups

In [47]:
plt.figure(figsize=(20,8))
# sns.barplot(x='DRG_Definition', y='Average_Total_Payments-mean', 
#            data=groupby_drg.sort_values('Average_Total_Payments-mean'))
plt_setup(plt)
plt.title('Mean Average Total Payments by DRG', fontsize=16)
plt.ylabel('Mean of Average Total Payments', fontsize=16)
Out[47]:
Text(0, 0.5, 'Mean of Average Total Payments')

Observation:

Some DRG groups have very high mean, which implies that there are some DRG groups which generally charge a very high amount for treatment in terms of 'Total Payments'.

3.3.3. Sum of Average Total Payments by DRG_Definition

In [48]:
pyt_by_drg = df.groupby('DRG_Definition').sum().reset_index()
In [49]:
pyt_by_drg = pyt_by_drg.sort_values('Average_Total_Payments', ascending=False)
# pyt_by_drg.head()
In [50]:
# Extract only rows with amount > 40,000,000

pyt_by_drg = pyt_by_drg.loc[pyt_by_drg['Average_Total_Payments'] > 40000000]
In [51]:
# plt.figure(figsize=(20,4))
# fig = sns.barplot(x='DRG_Definition', y='Average_Total_Payments', 
#             data=pyt_by_drg)
# fig.set_xticklabels(fig.get_xticklabels(), rotation=15)

# plt.title('Mean Average Total Payments by DRG, for total > 40,000,000', fontsize=16)
# plt.ylabel('Mean of Average Total Payments', fontsize=16)

Observation:

The DRG 329 is the highest in terms of total sum fom the Average Total Payments.

3.3.4. Unique ids, unique names, and unique cities for Providers

In [52]:
unique_ids = len(df.groupby('Provider_Id').count())
unique_providers = len(df.groupby('Provider_Name').count())
unique_cities = len(df.groupby('Provider_City').count())
unique_states = len(df.groupby('Provider_State').count())

print(" ")
print(f'There are {unique_ids} unique provider id values in the data, and {unique_providers} unique provider names in a total of {unique_cities} unique cities, and {unique_states} states.')
print(" ")
 
There are 3337 unique provider id values in the data, and 3201 unique provider names in a total of 1977 unique cities, and 51 states.
 

3.3.5. Check correlations between the three numerical/integer variables, region wise

In [53]:
fig = sns.pairplot(df[['Region', 'Total_Discharges', 'Average_Total_Payments','Average_Medicare_Payments']], 
             hue= 'Region')
fig
Out[53]:
<seaborn.axisgrid.PairGrid at 0x7f7ecb59b050>
In [54]:
corr = df[['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments']].corr()
f,ax = plt.subplots(figsize=(7,5))
sns.heatmap(corr, annot=True, cmap='Reds', linewidths=.4, fmt= '.1f',ax=ax)
# plt.show()
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ecb59b390>

Observation:

From above graphs, there are some variables that are highly correlated such as Average Total Payment and Average Medicare Payment. Average total payment has a long tail distribution, which could indicate potential fraud.

From corr matrix: Total payment is correlated with medicare payment.

We can conclude that those variables are indeed related, for modeling purposes, it more make sense to include only one or two of the three variables.

3.3.6. State wise-distribution of the three numerical variables

In [55]:
plt.figure(figsize=(20,20))
g = sns.PairGrid(df,
                 x_vars = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments'], 
                 y_vars = ['Provider_State'],
                 height=10, aspect=.25)

# Draw plot
g.map(sns.stripplot, size=10, orient="h",
      palette="ch:s=1,r=-.1,h=1_r", linewidth=1.10, edgecolor="w")

# Use the same x axis limits on all columns and add better labels
g.set(xlabel="", ylabel="")

titles = ["Total Discharges", "Average Total Payments",
          "Average Medicare Paymens"]

for ax, title in zip(g.axes.flat, titles):

    # Set a different title for each axes
    ax.set(title = title)

    # Make the grid horizontal instead of vertical
    ax.xaxis.grid(False)
    ax.yaxis.grid(True)

# plt.tight_layout()
# plt.show()
<Figure size 1440x1440 with 0 Axes>

3.3.7. Region wise-distribution of the three numerical variables

In [56]:
#plt.figure(figsize=(20,20))
#g = sns.PairGrid(df,
#                 x_vars = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments'], 
#                 y_vars = ['Region'],
#                 height=10, aspect=.25)

# Draw plot
#g.map(sns.stripplot, size=10, orient="h",
#      palette="ch:s=1,r=-.1,h=1_r", linewidth=1.10, edgecolor="w")

# Use the same x axis limits on all columns and add better labels
#g.set(xlabel="", ylabel="")

#titles = ["Total Discharges", "Average Total Payments",
#          "Average Medicare Paymens"]

#for ax, title in zip(g.axes.flat, titles):

    # Set a different title for each axes
#    ax.set(title = title)

    # Make the grid horizontal instead of vertical
#    ax.xaxis.grid(False)
#    ax.yaxis.grid(True)

# plt.tight_layout()
# plt.show()

.

-------------------------------------- SECTION BREAK --------------------------------------

.

4. Feature Engineering

Feature 1

By understanding the top DSG's per state, we esablish a baseline for what people in the state normally get treated for. This information is useful in one of two ways:

  • Can be used to flag non-common and expensive procedures
  • Focus fraud detection on the common procedures.

Assuming that the fraudulent users would try to get treated for the same conditions as the population.

In [57]:
drg_by_state = df.groupby(['Provider_State', 'DRG_Definition']).agg({'DRG_Definition': 'count'})
drg_by_state.head()
Out[57]:
DRG_Definition
Provider_State DRG_Definition
AK 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 1
057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC 1
064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC 2
065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC 6
066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC 4
In [58]:
drg_by_state.tail()
Out[58]:
DRG_Definition
Provider_State DRG_Definition
WY 872 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC 2
897 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REHABILITATION THERAPY W/O MCC 1
917 - POISONING & TOXIC EFFECTS OF DRUGS W MCC 1
918 - POISONING & TOXIC EFFECTS OF DRUGS W/O MCC 2
948 - SIGNS & SYMPTOMS W/O MCC 3

Note:

I am not merging this with the original dataset as this is a new data table created to be used as a reference to check the most common DRG's by state

Feature 2

4.2. Number of Provider_Names per city

This information can be used in one of two ways:

  • Assuming that the city with the most providers have a higher probablity of being a victim fraud. It allows to focus on the cities with the highest concentration.
  • Assuming the cities with the lowest provider per population densitity are a higher risk. The reason is that fraudsters have less options to cheat, and they are forced to choose from a select few.
In [59]:
providers_per_city = df.groupby(['Provider_City']).agg({'Provider_Name':'count'})
providers_per_city.head()
Out[59]:
Provider_Name
Provider_City
ABBEVILLE 18
ABERDEEN 107
ABILENE 152
ABINGDON 63
ABINGTON 99
In [60]:
fig = plt.figure(figsize=(10,7))
providers_per_city.sort_values(by = 'Provider_Name', ascending = False).head()
sns.distplot(providers_per_city)
plt.tight_layout()

Note:

I am not merging this with the original dataset as this is a new data table created to be used as a reference to check the most common hospitals.

Feature 3

4.3. Average cost per procedure/treatment

Assuming that people/hospitals would likely cheat on the most expensive procedures, this feature would allow us to focus on the ones that would have the highest likelihood.

In [61]:
df['Average_Cost_Per_Procedure'] = df['Average_Total_Payments']/df['Total_Discharges']
df.head(2)
Out[61]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0 63.486154
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0 128.809474
In [62]:
fig = plt.figure(figsize=(15,7))
plt.boxplot(df['Average_Cost_Per_Procedure'], vert=False)
plt.title('Averate Cost Per Procedure')
plt.xlabel("Cost in $")
plt.tight_layout()

Feature 4

4.4. Average Medicare Payments, converted to % of Average Total Payments

Medicare % paid varies by states, hospitals, and procedure. This feature will allow us to determing which hospitals, treatment and procedures are viewed favorably by medicare.

In [63]:
df['Medicare_%_Paid'] = df['Average_Medicare_Payments']/df['Average_Total_Payments']
df.head(2)
Out[63]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0 63.486154 0.824568
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0 128.809474 0.789722
In [64]:
fig = plt.figure(figsize=(15,7))
sns.boxplot(df['Medicare_%_Paid'])
plt.title('Percentage of Average Total Payments Paid by Medicare (Average)')
plt.tight_layout()

Observation:

Most average medicare payments are around 80-90% of the average total payments.

Feature 5

4.5. Medicare Payments, converted to % of Average Total Payments, State-wise

I am trying to understand if there are any states with 99% and greater medicare payout averages. Hypothesis is that those states are more attractive for fraud.

In [65]:
medicare_pct_state = df.groupby('Provider_State').agg({'Medicare_%_Paid': 'mean'}).reset_index()
medicare_pct_state.head(5)
Out[65]:
Provider_State Medicare_%_Paid
0 AK 0.871982
1 AL 0.816622
2 AR 0.834876
3 AZ 0.842718
4 CA 0.885084
In [66]:
df = pd.merge(left = df, right = medicare_pct_state, left_on = 'Provider_State', right_on = 'Provider_State',
              how = 'left')

df.rename(columns = {'Medicare_%_Paid_x':'Medicare_%_Paid',
                     'Medicare_%_Paid_y':'Medicare_%_Paid_State'}, inplace=True)

df.head(2)
Out[66]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622
In [67]:
fig = px.scatter(df, x="Provider_State", y="Medicare_%_Paid_State", width=1000, height=500,
             color = "Provider_State",
            title = "Medicare % Paid Distribution (by State)")
fig.show()
In [68]:
#fig = plt.figure(figsize=(15,7))
#sns.distplot(df['Medicare_%_Paid_State'])
#plt.title('Medicare % Paid Distribution (by State)')
#plt.tight_layout()

Feature 6

4.6. Out-of-pocket expense, difference between 'Average_Total_Payments' & 'Average_Medicare_Payments'

Out of pocket highlight procedures/treatments that are most expensive. The hypothesis is that the procedures with the highest out of pocket cost are the least likely to be a target for fraud.

In [69]:
df['Out_of_Pocket_Payment'] = df['Average_Total_Payments'] - df['Average_Medicare_Payments']
df.head(2)
Out[69]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26
In [70]:
sorted_avg_out_of_pocket = df.groupby(['DRG_Definition']).agg({'Out_of_Pocket_Payment': 'mean'})
sorted_avg_out_of_pocket.sort_values(by = 'Out_of_Pocket_Payment',ascending=False).head()
Out[70]:
Out_of_Pocket_Payment
DRG_Definition
460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC 3735.070150
473 - CERVICAL SPINAL FUSION W/O CC/MCC 2594.714232
247 - PERC CARDIOVASC PROC W DRUG-ELUTING STENT W/O MCC 2582.521719
207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT 96+ HOURS 2559.372528
853 - INFECTIOUS & PARASITIC DISEASES W O.R. PROCEDURE W MCC 2497.221490
In [71]:
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Out_of_Pocket_Payment'])
plt.title('Medicare_%_Paid_Distribution_by_State')
plt.tight_layout()

Feature 7

4.7. Out-of-Pocket expense per discharge

Hospital that have high out of pocket cost can be useful to narrow down the ones unattractive to fraudsters.

In [72]:
df['Out_of_Pocket_per_discharge'] = df['Out_of_Pocket_Payment']/df['Total_Discharges']
df.head(2)
Out[72]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 4763.73 South East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 ... 3865.50 South East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789

2 rows × 21 columns

In [73]:
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Out_of_Pocket_per_discharge'])
plt.tight_layout()

Feature 8

4.8. Total treatments per state

The nomimal amount of treatments per state can be a misleading number to look at, given that the states with the most people will bubble to the top. This feature can be useful when compared against the population.

In [74]:
patients_states = df['Provider_State'].value_counts()
patients_states = pd.DataFrame(patients_states).reset_index()
patients_states.columns = ['Provider_State','Count']
patients_states.head()
Out[74]:
Provider_State Count
0 CA 13064
1 TX 11864
2 FL 11155
3 NY 9178
4 IL 7909
In [75]:
df = pd.merge(left = df, right = patients_states, left_on = 'Provider_State', right_on = 'Provider_State',
              how = 'left')

df.rename(columns = {'Count':'State_Total'}, inplace=True)

df.head(2)
Out[75]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... South East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473 3635
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 ... South East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789 3635

2 rows × 22 columns

In [76]:
fig = px.scatter(df, x="Provider_State", y="State_Total", width=1000, height=500,
             color = "Provider_State",
            title = "Total Procedures/Treatments  per state")
fig.show()
In [77]:
#fig = plt.figure(figsize=(15,7))
#sns.distplot(df['State_Total'])
#plt.tight_layout()

Feature 9

4.9. Average treatments/patients by State (mean values, as grouped by State)

States have differents norms and rules. This features allows us to capture the normal state of each. Result can also be used to compare against the mean.

In [78]:
patient_avg_state = df.groupby('Provider_State').mean()[['Total_Discharges', 
                                                         'Average_Total_Payments',
                                                         'Average_Medicare_Payments']].reset_index()
patient_avg_state.head()
Out[78]:
Provider_State Total_Discharges Average_Total_Payments Average_Medicare_Payments
0 AK 26.588745 14572.391732 12958.969437
1 AL 39.258322 7568.232149 6418.007120
2 AR 41.978229 8019.248805 6919.720832
3 AZ 36.690284 10154.528211 8825.717240
4 CA 36.357854 12629.668472 11494.381678
In [79]:
patient_avg_state.loc[:,'Total_Discharges':'Average_Medicare_Payments'].corr()
Out[79]:
Total_Discharges Average_Total_Payments Average_Medicare_Payments
Total_Discharges 1.000000 -0.124043 -0.060745
Average_Total_Payments -0.124043 1.000000 0.991735
Average_Medicare_Payments -0.060745 0.991735 1.000000
In [80]:
fig = plt.figure(figsize=(15,10))

plt.subplot(2, 2, 1)
plt.boxplot(patient_avg_state['Total_Discharges'])
plt.title('Total Disharge Box plot')
plt.xlabel('')

plt.subplot(2, 2, 3)
plt.boxplot(patient_avg_state['Average_Total_Payments'])
plt.title('Average Total Payment Boxplot')
plt.xlabel('')

plt.subplot(2, 2, 4)
plt.boxplot(patient_avg_state['Average_Medicare_Payments'])
plt.title('Average Medicare Payment Boxplot')
plt.tight_layout()
plt.show()
In [81]:
# States with highest discharges

patient_avg_state.sort_values(by = 'Total_Discharges', ascending = False).head()
Out[81]:
Provider_State Total_Discharges Average_Total_Payments Average_Medicare_Payments
8 DE 67.901015 10360.072411 8959.673274
22 MI 54.539952 9754.420406 8662.157756
31 NJ 52.052839 10678.988647 9586.940056
20 MD 51.955255 12608.947664 11480.121829
27 NC 51.043841 9089.435711 7998.649702
In [82]:
# States with highest Average Total Payments

patient_avg_state.sort_values(by = 'Average_Total_Payments', ascending = False).head()
Out[82]:
Provider_State Total_Discharges Average_Total_Payments Average_Medicare_Payments
0 AK 26.588745 14572.391732 12958.969437
7 DC 43.954545 12998.029416 11811.967706
11 HI 26.497738 12775.739525 10967.475045
4 CA 36.357854 12629.668472 11494.381678
20 MD 51.955255 12608.947664 11480.121829

Note:

These new mean columns are useful, but I believe Median columns will be more handy. So, I will not add the mean columns to the original dataset yet.

Feature 10

4.10. Calculate the median of average total payment amount by DRG, by state.

'Median Average Total Payment'

To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.

In [83]:
median_drg_state = df.groupby(['DRG_Definition','Provider_State'])['Average_Total_Payments'].median().reset_index()
median_drg_state.head()
Out[83]:
DRG_Definition Provider_State Average_Total_Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AK 8401.95
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AL 5658.33
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AR 5890.00
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AZ 6959.89
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CA 7863.14
In [84]:
df = pd.merge(left = df, right = median_drg_state, left_on = ['DRG_Definition','Provider_State'], right_on = ['DRG_Definition','Provider_State'], how = 'left')
df.rename(columns={'Average_Total_Payments_y':'Median_Avg_Total_Pyt',
                   'Average_Total_Payments_x':'Average_Total_Payments'}, inplace=True)
df.head(2)
Out[84]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473 3635 5658.33
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 ... East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789 3635 4913.89

2 rows × 23 columns

In [85]:
# Check for one particular state and one particular DRG, to see median

df[(df['Provider_State'] == 'NV') & (df['DRG_Definition'] == '194 - SIMPLE PNEUMONIA & PLEURISY W CC')].head(2)
Out[85]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt
89303 194 - SIMPLE PNEUMONIA & PLEURISY W CC 290001 RENOWN REGIONAL MEDICAL CENTER 1155 MILL STREET RENO NV 89502 NV - Reno 100 26254.72 ... Mountain 38624.424 45159.0 66.807600 0.774291 0.827003 1507.91 15.079100 1202 6990.625
89400 194 - SIMPLE PNEUMONIA & PLEURISY W CC 290003 SUNRISE HOSPITAL AND MEDICAL CENTER 3186 S MARYLAND PKWY LAS VEGAS NV 89109 NV - Las Vegas 80 62134.62 ... Mountain 37456.017 9490.0 103.280125 0.765924 0.827003 1934.03 24.175375 1202 6990.625

2 rows × 23 columns

Feature 11

4.11. Creating a common median multiple score for Average Total Payments

Explain a potential fraud case

'Median Score`

I will now take the average total payment and divide it by the median payment to generate a simple score that indicates how many times the current payment amount is larger than the median amount.

In [86]:
df['Median_Score'] = df['Average_Total_Payments']/df['Median_Avg_Total_Pyt']
df.head(2)
Out[86]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473 3635 5658.33 1.021015
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 ... 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789 3635 4913.89 0.996107

2 rows × 24 columns

In [87]:
# fig = plt.figure(figsize=(10,5))
# sns.distplot(df['Median_Score'])
# plt.tight_layout()
In [88]:
fig = plt.figure(figsize=(15,5))
sns.boxplot(df['Median_Score'])
plt.tight_layout()
In [89]:
df['Median_Score'].describe()
Out[89]:
count    163065.000000
mean          1.050746
std           0.211465
min           0.517695
25%           0.925511
50%           1.000000
75%           1.112126
max           9.338775
Name: Median_Score, dtype: float64

Observation:

It would appear that most treatment payment amounts for the same DRG within the same state are within 90% to 110% of the median price. This is expected as normally doctors should be charging similar prices for similar treatment in simlar areas. However, we see instances where the payment amount is many times that of the median.

As we see in the box plot above, in two specific cases, treament cost over 9 times the median amount.

Let us investigate further.

Below are the cases where the payment made was over 6 times the median score amount.

In [90]:
df[df['Median_Score'] >= 6]
Out[90]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score
69869 203 - BRONCHITIS & ASTHMA W/O CC/MCC 220008 STURDY MEMORIAL HOSPITAL 211 PARK STREET ATTLEBORO MA 02703 RI - Providence 11 7965.18 ... 67397.7900 41916.0 3771.099091 0.043155 0.872525 39691.91 3608.355455 3842 4441.92 9.338775
125467 189 - PULMONARY EDEMA & RESPIRATORY FAILURE 390096 ST JOSEPH MEDICAL CENTER 2500 BERNVILLE ROAD READING PA 19605 PA - Reading 143 24542.94 ... 56478.8113 16695.0 509.076434 0.106021 0.843987 65079.84 455.103776 7804 7956.00 9.150067
130307 948 - SIGNS & SYMPTOMS W/O MCC 390312 CANCER TREATMENT CENTERS OF AMERICA 1331 EAST WYOMING AVENUE PHILADELPHIA PA 19124 PA - Philadelphia 24 83945.95 ... 29285.1249 62905.0 1207.008333 0.307033 0.843987 20074.00 836.416667 7804 4476.05 6.471822
156606 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... 500051 OVERLAKE HOSPITAL MEDICAL CENTER 1035-116TH AVE NE BELLEVUE WA 98004 WA - Seattle 23 44499.00 ... 99669.2313 27946.0 3673.880870 0.100600 0.841795 75998.66 3304.289565 2778 12850.13 6.575751

4 rows × 24 columns

Observation:

The highest median score is 9.33 which is for:

  • index number - 69869
  • DRG - 203 - BRONCHITIS & ASTHMA W/O CC/MCC
  • City - ATTLEBORO
  • State - MA

This individual was charged USD 41,482 for a treament that had median amount of just USD 4,441.92. This particular treatment was performed at Provider - STURDY MEMORIAL HOSPITAL.

Now, Let's examine this particular hospital more closely.

In [91]:
suspect_hospital1 = df[df['Provider_Name'] == 'STURDY MEMORIAL HOSPITAL']['Median_Score']

fig = plt.figure(figsize=(14,5))
sns.distplot(suspect_hospital1)
plt.tight_layout()

print(" ")
print("Median Score distribution for Provider - STURDY MEMORIAL HOSPITAL is as follows")
print(" ")
print(suspect_hospital1.describe())
 
Median Score distribution for Provider - STURDY MEMORIAL HOSPITAL is as follows
 
count    69.000000
mean      1.027317
std       1.016937
min       0.786568
25%       0.873115
50%       0.903796
75%       0.928759
max       9.338775
Name: Median_Score, dtype: float64

Observation:

The graphical representation is strange. The hospital typically charges less than the median amount for its treatments as we see that the highest number of observations fall beloew the median score of 1. This makes the treament with the median mulitple score of over 9 highly unusal.

Now, another hospital with a treatment which cost way over the median amount is 'ST JOSEPH MEDICAL CENTER', which is:

  • index number - 125467
  • DRG - 189 - PULMONARY EDEMA & RESPIRATORY FAILURE
  • City - READING
  • State - PA

Lets examine this hosptial's track record too.

In [92]:
suspect_hospital2 = df[df['Provider_Name'] == 'ST JOSEPH MEDICAL CENTER']['Median_Score']

fig = plt.figure(figsize=(14,5))
sns.distplot(suspect_hospital2)
plt.tight_layout()

print(" ")
print("Median Score distribution for Provider - ST JOSEPH MEDICAL CENTER is as follows")
print(" ")
print(suspect_hospital2.describe())
 
Median Score distribution for Provider - ST JOSEPH MEDICAL CENTER is as follows
 
count    427.000000
mean       1.060935
std        0.424824
min        0.720824
25%        0.933857
50%        1.007073
75%        1.099725
max        9.150067
Name: Median_Score, dtype: float64

Observation:

Again, this is a hospital that typically charges resonable prices for treatment, most observations are below the median score of 2, which is fine. But, this makes the one treatment that is over 9 times the median price an extreme outlier deserving of extra attention.

Making the broad assumption that around 1% of medical payments are fraudulent. I will tag any medical treatment that paid more than the top 99th percentile of median_scores in the dataset.

In [93]:
np.percentile(df['Median_Score'], 99)
Out[93]:
1.790032745554255

1% of medical treatments cost more than 1.79 times the median payment of that treament by state. Treaments that paid more than this shall be flagged.

Feature 12

4.12. Boolean Flag for Providers, based on the Median Score, if not in 99% percentile of data.

'Median Score Flag'

To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.

In [94]:
df['Median_Score_Flag'] = df['Median_Score'] >= np.percentile(df['Median_Score'], 99)
df.head(2)
Out[94]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473 3635 5658.33 1.021015 False
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 ... 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789 3635 4913.89 0.996107 False

2 rows × 25 columns

This approach is good for finding providers that overcharge substancially. However, it is not so useful in find hospitals that overcharge slightly but over the course of many treatments. One way to find these providers is to find which ones have the highest average median_score.

Feature 13

4.13. Calculate the Median Score, this time based in individual Provider level, previously we did on individual row level.

'Median Score by Provider'

To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.

In [95]:
Median_Score_by_Provider = df.groupby(['Provider_Name']).mean()['Median_Score'].reset_index()
print(Median_Score_by_Provider.head(2))

fig = plt.figure(figsize=(10,5))
sns.distplot(Median_Score_by_Provider['Median_Score'])
plt.tight_layout()

print(" ")
print("Median Score distribution for Provider -  is as follows")
print(" ")
print(Median_Score_by_Provider.describe())
                  Provider_Name  Median_Score
0    ABBEVILLE GENERAL HOSPITAL      1.033378
1  ABBOTT NORTHWESTERN HOSPITAL      1.056686
 
Median Score distribution for Provider -  is as follows
 
       Median_Score
count   3201.000000
mean       1.043923
std        0.200934
min        0.635737
25%        0.927984
50%        0.995734
75%        1.096960
max        3.777786

Some providers consistantly charge multiple times the median payment amount. However, before we blow the horn on these hospitals, we have to consider the fact that perhaps these hospitals are charging more than their state counter parts because they are either high-end/luxury hosptials and/or they are located in expensive cities. Lets see what hospitals these are.

In [96]:
df = pd.merge(left = df, right = Median_Score_by_Provider, left_on = 'Provider_Name', 
              right_on = 'Provider_Name', how = 'left')

df.rename(columns={'Median_Score_x':'Median_Score',
                   'Median_Score_y':'Median_Score_by_Provider'}, inplace=True)
df.head(2)
Out[96]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag Median_Score_by_Provider
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 63.486154 0.824568 0.816622 1013.51 11.137473 3635 5658.33 1.021015 False 1.019344
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 ... 128.809474 0.789722 0.816622 1029.26 27.085789 3635 4913.89 0.996107 False 1.019344

2 rows × 26 columns

Display the hospitals, that on average charge more than double the median price for the treatments they provide copmared to others within their state.

I had expected these hospitals to either be luxury hospitals or located in expensive cities.

In [97]:
df[df['Median_Score_by_Provider'] >= 2][['Provider_Name', 'Provider_State','Provider_City']].drop_duplicates()
Out[97]:
Provider_Name Provider_State Provider_City
15017 CONTRA COSTA REGIONAL MEDICAL CENTER CA MARTINEZ
19110 MEMORIAL HOSPITAL LOS BANOS CA LOS BANOS
23176 KEEFE MEMORIAL HOSPITAL CO CHEYENNE WELLS
23353 VAIL VALLEY MEDICAL CENTER CO VAIL
27595 JACKSON MEMORIAL HOSPITAL FL MIAMI
46304 MIDWESTERN REGION MED CENTER IL ZION
96910 GUADALUPE COUNTY HOSPITAL NM SANTA ROSA
130305 CANCER TREATMENT CENTERS OF AMERICA PA PHILADELPHIA
138799 UNIVERSITY OF TEXAS MEDICAL BRANCH GAL TX GALVESTON
159340 WELCH COMMUNITY HOSPITAL WV WELCH

Feature 14

4.14. Create a boolean benchmark for hospitals who overcharge

'Provider Flag by Median Score'

I set the benchmark at an upper level of 2 or higher median score, this will avoid situations where the cost of the city impacts the cost of treatment as even in the most expensive cities, the average cost of treament will in an expensive city will never be double the average cost of treament outside the city within the same state.

In [98]:
df['Provider_Flag_by_Median_Score'] = df['Median_Score_by_Provider'] >= 2
df.head(2)
Out[98]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag Median_Score_by_Provider Provider_Flag_by_Median_Score
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 0.824568 0.816622 1013.51 11.137473 3635 5658.33 1.021015 False 1.019344 False
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 ... 0.789722 0.816622 1029.26 27.085789 3635 4913.89 0.996107 False 1.019344 False

2 rows × 27 columns

See a the list of providers that are flagged as their treamtent payments are on average, more than double the median amount for the state they are in.

In [99]:
df[df['Provider_Flag_by_Median_Score']][['Provider_Id','Provider_Name','Provider_City',
                                         'Provider_State','Median_Score_by_Provider']].drop_duplicates()
Out[99]:
Provider_Id Provider_Name Provider_City Provider_State Median_Score_by_Provider
15017 50276 CONTRA COSTA REGIONAL MEDICAL CENTER MARTINEZ CA 2.254486
19110 50528 MEMORIAL HOSPITAL LOS BANOS LOS BANOS CA 2.508609
23176 60043 KEEFE MEMORIAL HOSPITAL CHEYENNE WELLS CO 2.376034
23353 60096 VAIL VALLEY MEDICAL CENTER VAIL CO 2.030422
27595 100022 JACKSON MEMORIAL HOSPITAL MIAMI FL 2.173011
46304 140100 MIDWESTERN REGION MED CENTER ZION IL 2.861803
96910 320067 GUADALUPE COUNTY HOSPITAL SANTA ROSA NM 2.187486
130305 390312 CANCER TREATMENT CENTERS OF AMERICA PHILADELPHIA PA 3.777786
138799 450018 UNIVERSITY OF TEXAS MEDICAL BRANCH GAL GALVESTON TX 2.336738
159340 510086 WELCH COMMUNITY HOSPITAL WELCH WV 2.085069

I will now pick the hospital with the highest score, which is - CANCER TREATMENT CENTERS OF AMERICA.

Now, I assume this hospital constantly overcharges on procedures that are much cheaper in other hospitals within the same state. Lets see it below.

In [100]:
df[df['Provider_Name'] == 'MEMORIAL HOSPITAL LOS BANOS'][['Median_Score']]
Out[100]:
Median_Score
19110 2.624709
19111 2.473714
19112 2.490096
19113 2.748748
19114 2.305088
19115 2.489811
19116 2.471841
19117 2.464865

Observation:

As we see, this is an effective method to track and find hospitals who overcharge or committ fraud.

Feature 15

4.15. Sum total of 'Total Discharged' grouped by the Provider Id and Provider Name

I will check the grand total of the number of discharges made by ech Provider. The hypothesis is that the hospitals with the highest number of diacharges are more susceptible to fraud, having false patients and fake claims.

In [101]:
discharges_provider = df.groupby(['Provider_Id','Provider_Name'])['Total_Discharges'].sum().reset_index(name='Grand Total of Discharges')
discharges_provider = discharges_provider.sort_values(by='Grand Total of Discharges', ascending=False)
discharges_provider.head()
Out[101]:
Provider_Id Provider_Name Grand Total of Discharges
599 100007 FLORIDA HOSPITAL 25828
1970 330101 NEW YORK-PRESBYTERIAN HOSPITAL 16834
2882 450388 METHODIST HOSPITAL 15921
583 80001 CHRISTIANA CARE HEALTH SERVICES, INC. 14542
1534 230130 WILLIAM BEAUMONT HOSPITAL 14469

Observation:

It is seen that the highest discharges are by 'Florida Hospital', at 25,828.

However, it would be ideal to check the total discharges as group be Provider State to get better inference.

In [102]:
discharges_provider_state = df.groupby(['Provider_State','Provider_Id', 'Provider_Name'])['Total_Discharges'].sum().reset_index(name='Grand Total of Discharges')
discharges_provider_state = discharges_provider_state.sort_values(by='Grand Total of Discharges', ascending=False)
discharges_provider_state.head()
Out[102]:
Provider_State Provider_Id Provider_Name Grand Total of Discharges
599 FL 100007 FLORIDA HOSPITAL 25828
2065 NY 330101 NEW YORK-PRESBYTERIAN HOSPITAL 16834
2882 TX 450388 METHODIST HOSPITAL 15921
590 DE 80001 CHRISTIANA CARE HEALTH SERVICES, INC. 14542
1534 MI 230130 WILLIAM BEAUMONT HOSPITAL 14469

Feature 16

4.16. Ratio of 'Average Total Payments' to 'Zip_Median_Income'

'Avg_Payment_by_Median_Income'

This ratio will give the an idea as to whether the average payments are higher or lower than the median income of the population. The hypothesis is that if the ratio is high, then the persons in that zip code are paying much higher than their median income for the treatment, which might be a fraud case, as the patient/hospital might have inflated bills.

Even the same hospital may have lower ratio for a particular DRG or State, but higher for another one, so I will not group the data. I need individual ratio for each row entry.

In [103]:
df['Avg_Payment_by_Median_Income'] = df['Average_Total_Payments'] / df['Zip_Median_Income']
df.head(2)
Out[103]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag Median_Score_by_Provider Provider_Flag_by_Median_Score Avg_Payment_by_Median_Income
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 0.816622 1013.51 11.137473 3635 5658.33 1.021015 False 1.019344 False 0.152001
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 ... 0.816622 1029.26 27.085789 3635 4913.89 0.996107 False 1.019344 False 0.128783

2 rows × 28 columns

In [104]:
fig, ax = plt.subplots(figsize= (20,5))
sns.boxplot(df["Avg_Payment_by_Median_Income"])
plt.title("Distribution of Average Total Payments by Zipcode Median Income", fontsize=20)
Out[104]:
Text(0.5, 1.0, 'Distribution of Average Total Payments by Zipcode Median Income')

The ones which are over 5 might need some investigation.

Feature 17

4.17. Ratio of 'Total Discharges' to 'Zip_Population'

'Total_Disc_by_Pop'

Hospitals havings very ratio are likely to be showing false patients.

Those with higher ratio, might need further investigation. Even the same hospital may have lower ratio for a particular DRG or State, but higher for another one, so I will not group the data. I need individual ratio for each row entry.

In [105]:
df['Total_Disc_by_Pop'] = df['Total_Discharges'] / df['Zip_Population']
df.head(2)
Out[105]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag Median_Score_by_Provider Provider_Flag_by_Median_Score Avg_Payment_by_Median_Income Total_Disc_by_Pop
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 1013.51 11.137473 3635 5658.33 1.021015 False 1.019344 False 0.152001 0.002545
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 20312.78 ... 1029.26 27.085789 3635 4913.89 0.996107 False 1.019344 False 0.128783 0.001063

2 rows × 29 columns

In [106]:
fig, ax = plt.subplots(figsize= (20,5))
sns.boxplot(df["Total_Disc_by_Pop"])
plt.title("Distribution of Total Discharges by Zipcode Population", fontsize=20)
Out[106]:
Text(0.5, 1.0, 'Distribution of Total Discharges by Zipcode Population')

The ratio's over 10 might need investigation.

.

-------------------------------------- SECTION BREAK --------------------------------------

.

5. Modelling

I will create a new copy of the data to perform the analysis, and call it 'df1'.

In [107]:
# Create a copy of the original dataset

df1 = df.copy()
In [108]:
# Check nulls

df1.isnull().sum().sum()
Out[108]:
48452

These NA's are created as a reuslt of merging the Income and the Population data by zipcode. There are certain zip codes in our data set, which do not have an entry n the zipcode dataset.

So, I will replace the NA's with the median of the respective column.

Impute missing Income and Population values with respective median

In [109]:
df1 = df1.fillna(df1.median())
In [110]:
df1.isnull().sum().sum()
Out[110]:
0

Drop irrelavant columns

There a lot of columns which will not be useful in the kmeans clustering analysis, most of which are categorical columns. I will drop them.

In [111]:
df1 = df1.drop(columns = ['Provider_Id', 'Provider_Name', 'Provider_Street_Address', 
                        'Provider_City', 'Provider_Zip_Code', 
                        'Hospital_Referral_Region_Description', 
                        'Region', 'Total_Discharges',
                        'Division', 'State_Total', 'Zip_Median_Income',
                        'Zip_Population', 'Median_Score_Flag',
                        'Provider_Flag_by_Median_Score'])
df1.head(2)
Out[111]:
DRG_Definition Provider_State Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge Median_Avg_Total_Pyt Median_Score Median_Score_by_Provider Avg_Payment_by_Median_Income Total_Disc_by_Pop
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AL 32963.07 5777.24 4763.73 63.486154 0.824568 0.816622 1013.51 11.137473 5658.33 1.021015 1.019344 0.152001 0.002545
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... AL 20312.78 4894.76 3865.50 128.809474 0.789722 0.816622 1029.26 27.085789 4913.89 0.996107 1.019344 0.128783 0.001063

Categorical Variables

I will drop the categrical columns, since I have alrwady used them to create meaningful new features

Label en-coding will be misleading for the clustering, and one-hot encoding creates many different new binary features, which is not ideal for a kNN and PCA clustering.

In [112]:
df1 = df1.drop(columns = ['DRG_Definition', 'Provider_State'])

Create a alias for numerical variables

In [113]:
# All numeric / float variables in thedataset

num_variables_all = ['Average_Covered_Charges', 'Average_Total_Payments', 'Average_Medicare_Payments',
                'Average_Cost_Per_Procedure', 'Medicare_%_Paid',
       'Medicare_%_Paid_State', 'Out_of_Pocket_Payment',
       'Out_of_Pocket_per_discharge', 'Median_Avg_Total_Pyt',
       'Median_Score', 'Median_Score_by_Provider',
       'Avg_Payment_by_Median_Income',
       'Total_Disc_by_Pop']

Check multi-collinearity

Drop one column from a pair with a ratio of above 0.7

In [114]:
corr = df1.corr()
f,ax = plt.subplots(figsize=(15,10))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f', ax=ax)
plt.show()
In [115]:
# Remove one each from the pair of highly collinear variables

df1 = df1.drop(columns = ['Average_Medicare_Payments', 'Average_Covered_Charges',
                          'Median_Avg_Total_Pyt', 'Out_of_Pocket_per_discharge',
                        'Average_Cost_Per_Procedure', 'Median_Score_by_Provider',
                       'Avg_Payment_by_Median_Income'])
In [116]:
# Final numeric variables selected

num_variables = ['Average_Total_Payments',
                'Medicare_%_Paid',
       'Medicare_%_Paid_State', 'Out_of_Pocket_Payment',
       'Median_Score',
       'Total_Disc_by_Pop']

Scaling/Standardizing all float or integer variables

I will use Standard Scalar/Standardize to scale all the numerical variables

In [117]:
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
In [118]:
X.isnull().sum().sum()
Out[118]:
0

5.1. Autoencoder clustering

Split into Train Test, to perform the Autoencoder clustering

For this, I will first split my data to train and test. This is essential to train the model.

Note:

Even the train data has many outliers, so instead of using the smalle test data, I will use the entire data (X) inplace of the test data.

In [119]:
X.shape[0] * 0.80
Out[119]:
130452.0
In [120]:
X_train = X.iloc[:130452,:] 
X_test = X.iloc[130453:,:] 
print("Shape of new dataframes - {} , {}".format(X_train.shape, X_test.shape))
Shape of new dataframes - (130452, 6) , (32612, 6)

Autoencoder Model

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”.

Plot in a 2-D space

In order to give you a good sense of what the data look like, I use PCA reduce to two dimensions and plot accordingly.

In [121]:
pca = PCA(2)
x_pca = pca.fit_transform(X_train)
x_pca = pd.DataFrame(x_pca)
x_pca.columns=['PC1','PC2']
x_pca.head()
Out[121]:
PC1 PC2
0 -1.177011 0.473709
1 -1.477861 0.653917
2 -0.494731 0.063673
3 -1.039397 0.509059
4 -1.486228 0.669817
In [122]:
# Plot
f,ax = plt.subplots(figsize=(8,5))
plt.scatter(x = x_pca['PC1'], y = x_pca['PC2'], alpha = 0.3)
plt.title('Scatter plot of PC1 and PC2')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

The light colored points might seem as outliers or anomalies.

Now, I will be building 3 different models, using three different hyperparametrs, which will help build a stable model

5.1.1. Model 1

Input and Output needs to be 6, as there are 6 variables

In [123]:
# train kNN detector

clf1 = AutoEncoder(hidden_neurons =[6, 5, 5, 6])
clf1.fit(X_train)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 6)                 42        
_________________________________________________________________
dropout (Dropout)            (None, 6)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 42        
_________________________________________________________________
dropout_1 (Dropout)          (None, 6)                 0         
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 42        
_________________________________________________________________
dropout_2 (Dropout)          (None, 6)                 0         
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 35        
_________________________________________________________________
dropout_3 (Dropout)          (None, 5)                 0         
_________________________________________________________________
dense_4 (Dense)              (None, 5)                 30        
_________________________________________________________________
dropout_4 (Dropout)          (None, 5)                 0         
_________________________________________________________________
dense_5 (Dense)              (None, 6)                 36        
_________________________________________________________________
dropout_5 (Dropout)          (None, 6)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 6)                 42        
=================================================================
Total params: 269
Trainable params: 269
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.1308 - val_loss: 0.9095
Epoch 2/100
3669/3669 [==============================] - 4s 981us/step - loss: 1.0259 - val_loss: 0.8945
Epoch 3/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0148 - val_loss: 0.8917
Epoch 4/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0130 - val_loss: 0.8909
Epoch 5/100
3669/3669 [==============================] - 3s 909us/step - loss: 1.0126 - val_loss: 0.8906
Epoch 6/100
3669/3669 [==============================] - 3s 849us/step - loss: 1.0124 - val_loss: 0.8905
Epoch 7/100
3669/3669 [==============================] - 3s 830us/step - loss: 1.0123 - val_loss: 0.8905
Epoch 8/100
3669/3669 [==============================] - 4s 963us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 9/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 10/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 11/100
3669/3669 [==============================] - 3s 880us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 12/100
3669/3669 [==============================] - 3s 878us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 13/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 14/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 15/100
3669/3669 [==============================] - 3s 875us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 16/100
3669/3669 [==============================] - 3s 879us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 17/100
3669/3669 [==============================] - 3s 831us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 18/100
3669/3669 [==============================] - 3s 839us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 19/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 20/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 21/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 22/100
3669/3669 [==============================] - 4s 958us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 23/100
3669/3669 [==============================] - 3s 905us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 24/100
3669/3669 [==============================] - 3s 875us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 25/100
3669/3669 [==============================] - 3s 829us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 26/100
3669/3669 [==============================] - 3s 848us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 27/100
3669/3669 [==============================] - 3s 865us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 28/100
3669/3669 [==============================] - 3s 890us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 29/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 30/100
3669/3669 [==============================] - 5s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 31/100
3669/3669 [==============================] - 3s 881us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 32/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 33/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 34/100
3669/3669 [==============================] - 3s 938us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 35/100
3669/3669 [==============================] - 4s 992us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 36/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 37/100
3669/3669 [==============================] - 3s 868us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 38/100
3669/3669 [==============================] - 3s 935us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 39/100
3669/3669 [==============================] - 3s 852us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 40/100
3669/3669 [==============================] - 3s 880us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 41/100
3669/3669 [==============================] - 3s 892us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 42/100
3669/3669 [==============================] - 3s 867us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 43/100
3669/3669 [==============================] - 3s 845us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 44/100
3669/3669 [==============================] - 3s 840us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 45/100
3669/3669 [==============================] - 4s 968us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 46/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 47/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 48/100
3669/3669 [==============================] - 4s 980us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 49/100
3669/3669 [==============================] - 5s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 50/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 51/100
3669/3669 [==============================] - 3s 898us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 52/100
3669/3669 [==============================] - 4s 985us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 53/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 54/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 55/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 56/100
3669/3669 [==============================] - 4s 993us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 57/100
3669/3669 [==============================] - 3s 885us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 58/100
3669/3669 [==============================] - 3s 855us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 59/100
3669/3669 [==============================] - 3s 847us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 60/100
3669/3669 [==============================] - 3s 843us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 61/100
3669/3669 [==============================] - 3s 844us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 62/100
3669/3669 [==============================] - 3s 855us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 63/100
3669/3669 [==============================] - 3s 791us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 64/100
3669/3669 [==============================] - 3s 797us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 65/100
3669/3669 [==============================] - 3s 795us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 66/100
3669/3669 [==============================] - 3s 819us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 67/100
3669/3669 [==============================] - 3s 819us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 68/100
3669/3669 [==============================] - 3s 835us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 69/100
3669/3669 [==============================] - 3s 800us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 70/100
3669/3669 [==============================] - 3s 803us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 71/100
3669/3669 [==============================] - 3s 823us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 72/100
3669/3669 [==============================] - 3s 799us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 73/100
3669/3669 [==============================] - 3s 865us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 74/100
3669/3669 [==============================] - 3s 855us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 75/100
3669/3669 [==============================] - 3s 861us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 76/100
3669/3669 [==============================] - 3s 835us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 77/100
3669/3669 [==============================] - 3s 801us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 78/100
3669/3669 [==============================] - 3s 830us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 79/100
3669/3669 [==============================] - 3s 800us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 80/100
3669/3669 [==============================] - 4s 954us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 81/100
3669/3669 [==============================] - 3s 857us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 82/100
3669/3669 [==============================] - 3s 786us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 83/100
3669/3669 [==============================] - 3s 938us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 84/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 85/100
3669/3669 [==============================] - 3s 946us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 86/100
3669/3669 [==============================] - 3s 865us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 87/100
3669/3669 [==============================] - 3s 886us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 88/100
3669/3669 [==============================] - 3s 917us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 89/100
3669/3669 [==============================] - 3s 912us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 90/100
3669/3669 [==============================] - 3s 871us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 91/100
3669/3669 [==============================] - 3s 898us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 92/100
3669/3669 [==============================] - 3s 817us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 93/100
3669/3669 [==============================] - 4s 972us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 94/100
3669/3669 [==============================] - 4s 987us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 95/100
3669/3669 [==============================] - 3s 939us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 96/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 97/100
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0122 - val_loss: 0.8904
Epoch 98/100
3669/3669 [==============================] - 4s 961us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 99/100
3669/3669 [==============================] - 3s 941us/step - loss: 1.0122 - val_loss: 0.8904
Epoch 100/100
3669/3669 [==============================] - 3s 850us/step - loss: 1.0122 - val_loss: 0.8904
Out[123]:
AutoEncoder(batch_size=32, contamination=0.1, dropout_rate=0.2, epochs=100,
      hidden_activation='relu', hidden_neurons=[6, 5, 5, 6],
      l2_regularizer=0.1,
      loss=<function mean_squared_error at 0x7f7edf693b90>,
      optimizer='adam', output_activation='sigmoid', preprocessing=True,
      random_state=None, validation_size=0.1, verbose=1)

Generate the anomaly score using clf.decision_function and visualize

  • "decision_functions()" predicts the outliers of a dataframe. A higher score means more abnormal.
  • The histogram below shows there are outliers. If we choose 1.0 to be the cutpoint, we can suggest those >=1.0 to be outliers

I will use the entire dataset, which is X as the Test data.

This is because even the train dataset containes outliers, and I need to idenify outliers in the entire dataset.

In [124]:
y_train_scores = clf1.decision_scores_ 
y_train_scores
Out[124]:
array([1.82483136, 1.97452885, 1.81487235, ..., 1.49876268, 0.95353885,
       0.9863651 ])
In [125]:
# get the prediction on the test data
y_test_pred = clf1.predict(X)  # outlier labels (0 or 1)

# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
y_test_scores = clf1.decision_function(X)  # outlier scores

y_test_pred = pd.Series(y_test_pred)
y_test_scores = pd.Series(y_test_scores)
In [126]:
y_test_pred.value_counts()
Out[126]:
0    146273
1     16792
dtype: int64

Plotting the Scores

In [127]:
# Plot it!
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores, bins=150)  
plt.title("Histogram for Model Autoencoder Clf1, Anomaly Scores, with 150 bins")
plt.xlabel('Score')
plt.show()

In the above graph, it is very difficult to see the distribution, as the range is very vast. So, I will cut the range to 10.0 and then re-plot

In [128]:
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 10.0]
# y_test_scores_small
In [129]:
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores_small, bins='auto')  
plt.title("Histogram for Model Autoencoder Clf1, Anomaly Scores, with auto bins")
plt.xlabel('Score')
plt.show()

Observation:

A high anomaly score probably means more abnormal transaction/score. The histogram above shows there are outliers. I will now chose a range of cut points, to check the distribution of clusters, and then identify some as suspicious.

Reasonable Boundary

Generally, looking at the graph, 6.0 seems the ideal number to be a cut point or the reasonable boundary. If we choose 6.0 to be the cut point, we can suggest those >= 6.0 to be outliers.

But, I want to investigate deeper, and I will chose 2 diifferent cut points, which are:

  • 6.0
  • 10.0

Now, lets evaluate these 2 different cut points, and build 5-cluster model.

Build a 3 cluster model as per above reasonable boundaries

In [130]:
df_test = pd.DataFrame(X)
df_test['score'] = y_test_scores

df_test['cluster'] = (np.where(df_test['score'] > 10.0, 3,
                                        (np.where(df_test['score'] > 6.0, 2, 1))))

df_test['cluster'].value_counts()
Out[130]:
1    160796
2      1746
3       523
Name: cluster, dtype: int64

Percentage of Data points in each cluster

In [131]:
temp = pd.DataFrame()

temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']

temp
Out[131]:
Percentage of total Cluster
1 98.608530 1
2 1.070739 2
3 0.320731 3

I will provde the description and business insight later below, when I aggregate with the 'Average' method

Summary statistics

In [132]:
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
Out[132]:
Average_Total_Payments Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Median_Score Total_Disc_by_Pop score
cluster
1 -0.048213 0.015430 -0.003469 -0.063578 -0.032782 -0.022991 1.778916
2 3.363007 -0.833901 0.234229 3.424887 2.006283 0.233353 7.344035
3 3.596005 -1.959903 0.284640 8.113254 3.380891 6.289655 17.103574

5.1.2. Model 2

In [133]:
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
In [134]:
X_train = X.iloc[:130452,:]
In [135]:
clf2 = AutoEncoder(hidden_neurons =[6, 5, 2, 5, 6], epochs = 20)
clf2.fit(X_train)


## I chose epochs = 20, as I notice that after 16th iteration, the loss stabilizes.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_7 (Dense)              (None, 6)                 42        
_________________________________________________________________
dropout_6 (Dropout)          (None, 6)                 0         
_________________________________________________________________
dense_8 (Dense)              (None, 6)                 42        
_________________________________________________________________
dropout_7 (Dropout)          (None, 6)                 0         
_________________________________________________________________
dense_9 (Dense)              (None, 6)                 42        
_________________________________________________________________
dropout_8 (Dropout)          (None, 6)                 0         
_________________________________________________________________
dense_10 (Dense)             (None, 5)                 35        
_________________________________________________________________
dropout_9 (Dropout)          (None, 5)                 0         
_________________________________________________________________
dense_11 (Dense)             (None, 2)                 12        
_________________________________________________________________
dropout_10 (Dropout)         (None, 2)                 0         
_________________________________________________________________
dense_12 (Dense)             (None, 5)                 15        
_________________________________________________________________
dropout_11 (Dropout)         (None, 5)                 0         
_________________________________________________________________
dense_13 (Dense)             (None, 6)                 36        
_________________________________________________________________
dropout_12 (Dropout)         (None, 6)                 0         
_________________________________________________________________
dense_14 (Dense)             (None, 6)                 42        
=================================================================
Total params: 266
Trainable params: 266
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0849 - val_loss: 1.0958
Epoch 2/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0010 - val_loss: 1.0815
Epoch 3/20
3669/3669 [==============================] - 3s 953us/step - loss: 0.9937 - val_loss: 1.0786
Epoch 4/20
3669/3669 [==============================] - 4s 994us/step - loss: 0.9921 - val_loss: 1.0779
Epoch 5/20
3669/3669 [==============================] - 4s 979us/step - loss: 0.9917 - val_loss: 1.0777
Epoch 6/20
3669/3669 [==============================] - 4s 1ms/step - loss: 0.9916 - val_loss: 1.0776
Epoch 7/20
3669/3669 [==============================] - 4s 1ms/step - loss: 0.9915 - val_loss: 1.0776
Epoch 8/20
3669/3669 [==============================] - 4s 1ms/step - loss: 0.9915 - val_loss: 1.0775
Epoch 9/20
3669/3669 [==============================] - 3s 953us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 10/20
3669/3669 [==============================] - 4s 977us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 11/20
3669/3669 [==============================] - 3s 950us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 12/20
3669/3669 [==============================] - 4s 973us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 13/20
3669/3669 [==============================] - 4s 996us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 14/20
3669/3669 [==============================] - 3s 932us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 15/20
3669/3669 [==============================] - 3s 903us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 16/20
3669/3669 [==============================] - 3s 913us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 17/20
3669/3669 [==============================] - 3s 898us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 18/20
3669/3669 [==============================] - 3s 927us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 19/20
3669/3669 [==============================] - 3s 927us/step - loss: 0.9914 - val_loss: 1.0775
Epoch 20/20
3669/3669 [==============================] - 4s 1ms/step - loss: 0.9914 - val_loss: 1.0775
Out[135]:
AutoEncoder(batch_size=32, contamination=0.1, dropout_rate=0.2, epochs=20,
      hidden_activation='relu', hidden_neurons=[6, 5, 2, 5, 6],
      l2_regularizer=0.1,
      loss=<function mean_squared_error at 0x7f7edf693b90>,
      optimizer='adam', output_activation='sigmoid', preprocessing=True,
      random_state=None, validation_size=0.1, verbose=1)
In [136]:
y_train_scores = clf2.decision_scores_ 
y_train_scores
Out[136]:
array([1.82739647, 1.97734463, 1.81649377, ..., 1.50026115, 0.95358793,
       0.98745553])
In [137]:
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
y_test_scores = clf2.decision_function(X)  # outlier scores
y_test_scores = pd.Series(y_test_scores)

Plotting the Scores

In [138]:
# Plot it!
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores, bins=150)  
plt.title("Histogram for Model Autoencoder Clf2, Anomaly Scores, with 150 bins")
plt.xlabel('Score')
plt.show()

In the above graph, it is very difficult to see the distribution, as the range is very vast. So, I will cut the range to 10.0 and then re-plot

In [139]:
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 10.0]
# y_test_scores_small
In [140]:
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores_small, bins='auto')  
plt.title("Histogram for Model Autoencoder Clf1, Anomaly Scores, with auto bins")
plt.xlabel('Score')
plt.show()

Build a 3 cluster model

In [141]:
df_test = pd.DataFrame(X)
df_test['score'] = y_test_scores

df_test['cluster'] = (np.where(df_test['score'] > 10.0, 3,
                                        (np.where(df_test['score'] > 6.0, 2, 1))))

df_test['cluster'].value_counts()
Out[141]:
1    160799
2      1743
3       523
Name: cluster, dtype: int64

Percentage of Data points in each cluster

In [142]:
temp = pd.DataFrame()

temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']

temp
Out[142]:
Percentage of total Cluster
1 98.610370 1
2 1.068899 2
3 0.320731 3

I will provde the description and business insight later below, when I aggregate with the 'Average' method

Summary statistics

In [143]:
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
Out[143]:
Average_Total_Payments Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Median_Score Total_Disc_by_Pop score
cluster
1 -0.048136 0.015455 -0.003452 -0.063578 -0.032743 -0.022992 1.779754
2 3.361720 -0.837751 0.233064 3.430939 2.006263 0.233808 7.344575
3 3.596005 -1.959903 0.284640 8.113254 3.380891 6.289655 17.101918

5.1.3. Model 3

In [144]:
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
In [145]:
X_train = X.iloc[:130452,:]
In [146]:
clf3 = AutoEncoder(hidden_neurons =[6, 5, 3, 2, 3, 5, 6], epochs = 20)
clf3.fit(X_train)
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_15 (Dense)             (None, 6)                 42        
_________________________________________________________________
dropout_13 (Dropout)         (None, 6)                 0         
_________________________________________________________________
dense_16 (Dense)             (None, 6)                 42        
_________________________________________________________________
dropout_14 (Dropout)         (None, 6)                 0         
_________________________________________________________________
dense_17 (Dense)             (None, 6)                 42        
_________________________________________________________________
dropout_15 (Dropout)         (None, 6)                 0         
_________________________________________________________________
dense_18 (Dense)             (None, 5)                 35        
_________________________________________________________________
dropout_16 (Dropout)         (None, 5)                 0         
_________________________________________________________________
dense_19 (Dense)             (None, 3)                 18        
_________________________________________________________________
dropout_17 (Dropout)         (None, 3)                 0         
_________________________________________________________________
dense_20 (Dense)             (None, 2)                 8         
_________________________________________________________________
dropout_18 (Dropout)         (None, 2)                 0         
_________________________________________________________________
dense_21 (Dense)             (None, 3)                 9         
_________________________________________________________________
dropout_19 (Dropout)         (None, 3)                 0         
_________________________________________________________________
dense_22 (Dense)             (None, 5)                 20        
_________________________________________________________________
dropout_20 (Dropout)         (None, 5)                 0         
_________________________________________________________________
dense_23 (Dense)             (None, 6)                 36        
_________________________________________________________________
dropout_21 (Dropout)         (None, 6)                 0         
_________________________________________________________________
dense_24 (Dense)             (None, 6)                 42        
=================================================================
Total params: 294
Trainable params: 294
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.1456 - val_loss: 0.9411
Epoch 2/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0218 - val_loss: 0.9210
Epoch 3/20
3669/3669 [==============================] - 6s 2ms/step - loss: 1.0115 - val_loss: 0.9179
Epoch 4/20
3669/3669 [==============================] - 5s 1ms/step - loss: 1.0099 - val_loss: 0.9172
Epoch 5/20
3669/3669 [==============================] - 6s 2ms/step - loss: 1.0096 - val_loss: 0.9169
Epoch 6/20
3669/3669 [==============================] - 4s 987us/step - loss: 1.0094 - val_loss: 0.9168
Epoch 7/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0094 - val_loss: 0.9167
Epoch 8/20
3669/3669 [==============================] - 6s 2ms/step - loss: 1.0093 - val_loss: 0.9167
Epoch 9/20
3669/3669 [==============================] - 5s 1ms/step - loss: 1.0093 - val_loss: 0.9167
Epoch 10/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0093 - val_loss: 0.9167
Epoch 11/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0093 - val_loss: 0.9167
Epoch 12/20
3669/3669 [==============================] - 4s 985us/step - loss: 1.0093 - val_loss: 0.9166
Epoch 13/20
3669/3669 [==============================] - 4s 994us/step - loss: 1.0093 - val_loss: 0.9166
Epoch 14/20
3669/3669 [==============================] - 5s 1ms/step - loss: 1.0093 - val_loss: 0.9166
Epoch 15/20
3669/3669 [==============================] - 5s 1ms/step - loss: 1.0093 - val_loss: 0.9166
Epoch 16/20
3669/3669 [==============================] - 6s 2ms/step - loss: 1.0093 - val_loss: 0.9166
Epoch 17/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0093 - val_loss: 0.9166
Epoch 18/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0093 - val_loss: 0.9166
Epoch 19/20
3669/3669 [==============================] - 4s 1ms/step - loss: 1.0093 - val_loss: 0.9166
Epoch 20/20
3669/3669 [==============================] - 4s 992us/step - loss: 1.0093 - val_loss: 0.9166
Out[146]:
AutoEncoder(batch_size=32, contamination=0.1, dropout_rate=0.2, epochs=20,
      hidden_activation='relu', hidden_neurons=[6, 5, 3, 2, 3, 5, 6],
      l2_regularizer=0.1,
      loss=<function mean_squared_error at 0x7f7edf693b90>,
      optimizer='adam', output_activation='sigmoid', preprocessing=True,
      random_state=None, validation_size=0.1, verbose=1)

EPOCHS: Here, I chose the value 20, as after the 7th iteration, the loss stabilizes

Visualize loss history

In [147]:
pd.DataFrame.from_dict(clf3.history_).plot(title='Error Loss History');
In [148]:
y_train_scores = clf3.decision_scores_ 
y_train_scores
Out[148]:
array([1.8273561 , 1.97727674, 1.81639434, ..., 1.50032766, 0.95427348,
       0.98825012])
In [149]:
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
y_test_scores = clf3.decision_function(X)  # outlier scores
y_test_scores = pd.Series(y_test_scores)

Plotting the Scores

In [150]:
# Plot it!
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores, bins=150)  
plt.title("Histogram for Model Autoencoder Clf3, Anomaly Scores, with 150 bins")
plt.xlabel('Score')
plt.show()

In the above graph, it is very difficult to see the distribution, as the range is very vast. So, I will cut the range to 10.0 and then re-plot

In [151]:
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 10.0]
# y_test_scores_small
In [152]:
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores_small, bins='auto')  
plt.title("Histogram for Model Autoencoder Clf3, Anomaly Scores, with auto bins")
plt.xlabel('Score')
plt.show()

Build a 3 cluster model

In [153]:
df_test = pd.DataFrame(X)
df_test['score'] = y_test_scores

df_test['cluster'] = (np.where(df_test['score'] > 10.0, 3,
                                        (np.where(df_test['score'] > 6.0, 2, 1))))

df_test['cluster'].value_counts()
Out[153]:
1    160800
2      1742
3       523
Name: cluster, dtype: int64

Percentage of Data points in each cluster

In [154]:
temp = pd.DataFrame()

temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']

temp
Out[154]:
Percentage of total Cluster
1 98.610983 1
2 1.068286 2
3 0.320731 3

I will provde the description and business insight later below, when I aggregate with the 'Average' method

Summary statistics

In [155]:
# Now let's show the summary statistics:

df_test.groupby('cluster').mean()
Out[155]:
Average_Total_Payments Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Median_Score Total_Disc_by_Pop score
cluster
1 -0.048137 0.015426 -0.003444 -0.063561 -0.032732 -0.022992 1.779880
2 3.363801 -0.835532 0.232484 3.431336 2.006357 0.233961 7.344769
3 3.596005 -1.959903 0.284640 8.113254 3.380891 6.289655 17.101343

Observation:

All three models produce the exact same number of data points in each cluster.

Note:

Outlier Detection Models Are Prone and sensitive to Outliers/Anomalies

Also, Outlier detection models tend to be very sensitive to outliers. In our case the unsupervised k-NN model can easily commit overfitting. We need to produce stable models.

Achieve Model Stability by Aggregating Multiple Models

The solution is to train multiple models then aggregate the scores. By aggregating multiple models, the chance of overfitting is greatly reduced and the prediction accuracy will be improved. The PyOD module offers four methods to aggregate the outcome. I will use the 'Average' method for my analysis.

5.1.4. Aggregate by 'Average' Method to get better results

In [156]:
new_X = X.drop(columns = ['score', 'cluster'])
In [157]:
# Put all the predictions in a data frame
train_scores = pd.DataFrame({'clf1': clf1.decision_scores_,
                             'clf2': clf2.decision_scores_,
                             'clf3': clf3.decision_scores_
                            })

test_scores  = pd.DataFrame({'clf1': clf1.decision_function(new_X),
                             'clf2': clf2.decision_function(new_X),
                             'clf3': clf3.decision_function(new_X) 
                            })
In [158]:
# Although we did standardization before, it was for the variables.
# Now we do the standardization for the decision scores

train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)

Average Method

In [159]:
y_by_average = average(test_scores_norm)
y_by_average[1:10]
Out[159]:
array([ 0.0519935 , -0.05145131, -0.07393179,  0.0627655 ,  0.19510981,
        0.01899418,  0.43205148,  0.01813374, -0.02456071])

Plotting the scores

In [160]:
fig, ax = plt.subplots(figsize= (20,6))
plt.plot(y_by_average);
plt.axhline(y=clf3.threshold_, c='r', ls='dotted', label='threshoold');
plt.title('Anomaly Scores with automatically calculated threshold, for model Autoencoder Clf3');
In [161]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_by_average, bins=200) # arguments are passed to np.histogram
plt.title("Combination by average, for model Autoencoder, with 200 bins")
plt.xlabel('Score')
plt.show()

In the above graph, it is very difficult to see the distribution, as the range is very vast. So, I will cut the range at 8.0 and then re-plot

In [162]:
y_by_average_small = y_by_average[y_by_average[:, ] < 8.0]
y_by_average_small
Out[162]:
array([-0.04461277,  0.0519935 , -0.05145131, ..., -0.22014229,
       -0.18677936,  0.53542838])
In [163]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_by_average_small, bins = 'auto')  # arguments are passed to np.histogram
plt.title("Histogram for Combination by Average, Autoencoder, Anomaly Scores, with auto bins")
plt.xlabel('Score')
plt.show()

Build a 3 cluster model

In [164]:
df_test = pd.DataFrame(X)
df_test['score'] = y_by_average

df_test['cluster'] = (np.where(df_test['score'] > 10.0, 3,
                                        (np.where(df_test['score'] > 2.0, 2, 1))))

df_test['cluster'].value_counts()
Out[164]:
1    159312
2      3623
3       130
Name: cluster, dtype: int64

Percentage of Data points in each cluster

In [165]:
temp = pd.DataFrame()

temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']

temp
Out[165]:
Percentage of total Cluster
1 97.698464 1
2 2.221813 2
3 0.079723 3
In [166]:
fig = px.bar(temp, x = 'Cluster', y = 'Percentage of total', color = 'Cluster',
             width=600, height=400,
            title = "Percentage of total in each cluster")
fig.show()

Final Summary statistics

In [167]:
cluster_df = df_test.groupby('cluster').mean().reset_index()
cluster_df
Out[167]:
cluster Average_Total_Payments Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Median_Score Total_Disc_by_Pop score
0 1 -0.072305 0.019843 -0.005284 -0.081430 -0.047028 -0.023663 -0.097021
1 2 3.057322 -0.800623 0.211932 3.181745 1.927611 0.344101 3.364443
2 3 3.402828 -2.003812 0.569357 11.117692 3.911148 19.408110 19.016398
In [168]:
cluster_df.plot(x = 'cluster', y = 'Out_of_Pocket_Payment', color = 'green', kind = 'bar')
Out[168]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7eb1835990>
In [169]:
cluster_df.plot(x = 'cluster', y ='Total_Disc_by_Pop' , color = 'red', kind ='bar')
Out[169]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7eaf163910>
In [170]:
cluster_df.plot(x = 'cluster', y ='score' , color = 'blue', kind ='bar')
Out[170]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ead636fd0>

Observation:

From the above 3 cluster analysis, the following can be explained:

A. Clusters 2, and 3 have a very small percentage of total data in their cluster pool. So, there are two possible options for these clusters:

  • Either they are true anomalies
  • They are provider of rare services, or very expensive operations, surgeries or services, performed only by very limited number of providers in the country.

B. Cluster 3 has a roughly 0.07%, and Cluster 2 has roughly 2.08% of the data in its cluster pool. These are the critical cut-off points, as I believe anything near or less than 10% of the entire data, represents a different but normal cluster group.

C. The other clusters which have over 10% of the total data in their cluster pools, seem normal and fine. The differences lie probably due to state or the drg definition or the state. But they are not not likely to be a cause for anomalies.

ANOMALIES or SUSPICIOUS clusters:

The above table gives a comprehensive view of the means, as per the different clusters which is useful to identofy anomalies.

From the final 6 features used for clustering, I generated 3 different clusters using the aggregated average method. The following are the key takeaways for anomalies and suspicious clusters:

Total Discharges by Population

  1. Clusters 3 has an extremely high value, away from the normal mean of other clusters. Cluster 3 has a mean almost 4-5 times the mean of other clusters, which is very a big concern, even though this cluster has like 0.07% of the total data points.

Average Total Payments

  1. Cluster 2 and 3 have extremely high values. Cluster 3, which has around 0.07% of the total data points, and cluster 2, which has 2.08% of the data, both have a mean of 3, which is over 3-4 times the means of the other clusters.

Out of Pocket Payment

  1. Cluster 3 has extremely high values. Cluster 3, which has around 0.07% of the total data points, has a mean of over 11, which is over 10-11 times the means of the other clusters.

Median Score

  1. Cluster 3 has the highest mean amongst all other clusters means, 3-4 times higher than all other clusters.

From the above observations, it is clear that there is 1 distinct cluster, which has a very high cluster mean for a lot of variables, if not all, and emerges as suspicious cases or has anomalies. This is cluster 3.

Combining these results with the Percentage results obtained in the previous step, I can concule that Clusters 3 is suspicious and need further investigation.

BUSINESS INSIGHTS

I performed the iForest clustering for the data, and also used the 'Average' aggregate method to build a stable model. On building the average model, I found that the original model I built was stable and had good insights.

My analysis gave me 3 clusters, and out of those 3, Cluster 3 seems to be the one which has a lot a potential anomalies, and is suspicious because the means for the variables, as explained above, are far away from the means of the variables of other clusters.

Further, on evaluating the anomaly score for all the clusters, which gives us insights about the clusters which are anomalies, as the anomalies might have a very very high score comapred to others, I see that cluster 3 has a score almost 19 times higher than all other clusters. So, I can safely conclude that Cluster 3 is highly suspicious.

So, I would pass on the 130 specific entries of the Cluster 3 to the relevant authorities, and call for further investigation on each of the entries, to understand of they are true anomalies. I will provide all the reasoning as I have highlighted above, as to the differences in the means, and walk through the process I have done.

5.2. iForest

Scaling all float or integer variables

I will use Standard Scalar to scale all the numerical variables

In [171]:
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
In [172]:
X_train = X.iloc[:130452,:]

iForest Clustering

5.2.1. Model 1

In [173]:
# Here, I will take the length of train as the max samples

clf1 = IForest(behaviour="new", max_samples=len(X_train)) 
clf1.fit(X_train)
Out[173]:
IForest(behaviour='new', bootstrap=False, contamination=0.1, max_features=1.0,
    max_samples=130452, n_estimators=100, n_jobs=1, random_state=None,
    verbose=0)

For the purpose of testing the trained set, I will use the entire dataset, which X.

This is because even the train dataset containes outliers, and I need to idenify outliers in the entire dataset.

In [174]:
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
# We apply the model to the test data X to get the outlier scores.
y_test_scores = clf1.decision_function(X)  # outlier scores
y_test_scores = pd.Series(y_test_scores)
y_test_scores.head()
Out[174]:
0   -0.017765
1   -0.020314
2   -0.006392
3   -0.010510
4   -0.017119
dtype: float64

Plotting the Scores

In [175]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores, bins =150)  # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf1, Anomaly Scores, with 150 bins")
plt.xlabel('Distance')
plt.show()

In the above graph, it is very difficult to see the distribution, as the range is very vast. So, I will cut the range at 0.2 and then re-plot

In [176]:
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 0.2]
#y_test_scores_small
In [177]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores_small, bins = 'auto')  # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf1, Anomaly Scores, with auto bins")
plt.xlabel('Distance')
plt.show()

Observation:

A high anomaly score probably means more abnormal transaction/score. The histogram above shows there are outliers. I will now chose a range of cut points, to check the distribution of clusters, and then identify some as suspicious.

Reasonable Boundary

Generally, looking at the graph, 0.05 seems the ideal number to be a cut point or the reasonable boundary. If we choose 0.05 to be the cut point, we can suggest those >= 0.05 to be outliers.

But, I want to investigate deeper, and I will chose 2 diifferent cut points, which are:

  • 0.05
  • 0.10

Now, lets evaluate these 2 different cut points, and build 5-cluster model.

Build a 3 cluster model as per above reasonable boundaries

In [178]:
df_test = pd.DataFrame(X)
df_test['distance'] = y_test_scores

df_test['cluster'] = (np.where(df_test['distance'] > 0.10, 3,
                                        (np.where(df_test['distance'] > 0.05, 2, 1))))

df_test['cluster'].value_counts()
Out[178]:
1    159153
2      2765
3      1147
Name: cluster, dtype: int64

Percentage of Data points in each cluster

In [179]:
temp = pd.DataFrame()

temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']

temp
Out[179]:
Percentage of total Cluster
1 97.600957 1
2 1.695643 2
3 0.703400 3

I will provde the description and business insight later below, when I aggregate with the 'Average' method

Summary statistics

In [180]:
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
Out[180]:
Average_Total_Payments Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Median_Score Total_Disc_by_Pop distance
cluster
1 -0.057836 0.025343 0.001673 -0.077345 -0.040763 -0.023621 -0.038729
2 2.096976 -0.720663 -0.028560 2.121338 1.198851 0.379876 0.069414
3 2.970079 -1.779165 -0.163358 5.618339 2.766047 2.361793 0.151507

5.2.2. Model 2

In [181]:
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
In [182]:
X_train = X.iloc[:130452,:]
In [183]:
# Here, I will take the 80% of the length of train as the max samples

clf2 = IForest(behaviour="new", max_samples=int(round((len(X_train)*0.8),0))) 
clf2.fit(X_train)
Out[183]:
IForest(behaviour='new', bootstrap=False, contamination=0.1, max_features=1.0,
    max_samples=104362, n_estimators=100, n_jobs=1, random_state=None,
    verbose=0)

For the purpose of testing the trained set, I will use the entire dataset, which X. This is because even the train dataset containes outliers, and I need to idenify outliers in the entire dataset.

In [184]:
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
# We apply the model to the test data X to get the outlier scores.
y_test_scores = clf2.decision_function(X)  # outlier scores
y_test_scores = pd.Series(y_test_scores)

Plotting the Scores

In [185]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores, bins =150)  # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf2, Anomaly Scores, with 150 bins")
plt.xlabel('Distance')
plt.show()

In the above graph, it is very difficult to see the distribution, as the range is very vast. So, I will cut the range 0.2 and then re-plot

In [186]:
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 0.2]
#y_test_scores_small
In [187]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores_small, bins = 'auto')  # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf2, Anomaly Scores, with auto bins")
plt.xlabel('Distance')
plt.show()

Build a 3 cluster model

In [188]:
df_test = pd.DataFrame(X)
df_test['distance'] = y_test_scores

df_test['cluster'] = (np.where(df_test['distance'] > 0.10, 3,
                                        (np.where(df_test['distance'] > 0.05, 2, 1))))

df_test['cluster'].value_counts()
Out[188]:
1    159052
2      2742
3      1271
Name: cluster, dtype: int64

Percentage of Data points in each cluster

In [189]:
temp = pd.DataFrame()

temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']

temp
Out[189]:
Percentage of total Cluster
1 97.539018 1
2 1.681538 2
3 0.779444 3

I will provde the description and business insight later below, when I aggregate with the 'Average' method

Summary statistics

In [190]:
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
Out[190]:
Average_Total_Payments Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Median_Score Total_Disc_by_Pop distance
cluster
1 -0.061432 0.022489 -0.003829 -0.076136 -0.043942 -0.024892 -0.038746
2 2.140488 -0.542618 0.195285 1.962936 1.230469 0.357060 0.069204
3 3.069807 -1.643697 0.057894 5.292899 2.844308 2.344668 0.152862

5.2.3. Model 3

In [191]:
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
In [192]:
X_train = X.iloc[:130452,:]
In [193]:
# Here, I will take the 60% of the length of train as the max samples

clf3 = IForest(behaviour="new", max_samples=int(round((len(X_train)*0.6),0))) 
clf3.fit(X_train)
Out[193]:
IForest(behaviour='new', bootstrap=False, contamination=0.1, max_features=1.0,
    max_samples=78271, n_estimators=100, n_jobs=1, random_state=None,
    verbose=0)

For the purpose of testing the trained set, I will use the entire dataset, which X. This is because even the train dataset containes outliers, and I need to idenify outliers in the entire dataset.

In [194]:
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
# We apply the model to the test data X to get the outlier scores.
y_test_scores = clf3.decision_function(X)  # outlier scores
y_test_scores = pd.Series(y_test_scores)

Plotting the Scores

In [195]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores, bins =150)  # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf3, Anomaly Scores, with 150 bins")
plt.xlabel('Distance')
plt.show()

In the above graph, it is very difficult to see the distribution, as the range is very vast. So, I will cut the range at 0.2 and then re-plot

In [196]:
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 0.2]
#y_test_scores_small
In [197]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores_small, bins = 'auto')  # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf3, Anomaly Scores, with auto bins")
plt.xlabel('Distance')
plt.show()

Build a 3 cluster model

In [198]:
df_test = pd.DataFrame(X)
df_test['distance'] = y_test_scores

df_test['cluster'] = (np.where(df_test['distance'] > 0.10, 3,
                                        (np.where(df_test['distance'] > 0.05, 2, 1))))

df_test['cluster'].value_counts()
Out[198]:
1    158462
2      3137
3      1466
Name: cluster, dtype: int64

Percentage of Data points in each cluster

In [199]:
temp = pd.DataFrame()

temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']

temp
Out[199]:
Percentage of total Cluster
1 97.177199 1
2 1.923773 2
3 0.899028 3

I will provde the description and business insight later below, when I aggregate with the 'Average' method

Summary statistics

In [200]:
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
Out[200]:
Average_Total_Payments Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Median_Score Total_Disc_by_Pop distance
cluster
1 -0.064011 0.030870 0.001659 -0.087325 -0.042281 -0.025390 -0.042233
2 1.989104 -0.701649 -0.035798 2.007262 1.030548 0.209914 0.069865
3 2.662647 -1.835372 -0.102721 5.143884 2.364976 2.295209 0.151936

Note:

Outlier Detection Models Are Prone and sensitive to Outliers/Anomalies

Also, Outlier detection models tend to be very sensitive to outliers. In our case the unsupervised k-NN model can easily commit overfitting. We need to produce stable models.

Achieve Model Stability by Aggregating Multiple Models

The solution is to train multiple models then aggregate the scores. By aggregating multiple models, the chance of overfitting is greatly reduced and the prediction accuracy will be improved. The PyOD module offers four methods to aggregate the outcome. I will use the 'Average' method for my analysis.

5.2.4. Aggregate by 'Average' Method to get better results

In [201]:
new_X = X.drop(columns = ['distance', 'cluster'])
In [202]:
# Put all the predictions in a data frame
train_scores = pd.DataFrame({'clf1': clf1.decision_scores_,
                             'clf2': clf2.decision_scores_,
                             'clf3': clf3.decision_scores_
                            })

test_scores  = pd.DataFrame({'clf1': clf1.decision_function(new_X),
                             'clf2': clf2.decision_function(new_X),
                             'clf3': clf3.decision_function(new_X) 
                            })
In [203]:
# Although we did standardization before, it was for the variables.
# Now we do the standardization for the decision scores

train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)

Average Method

In [204]:
y_by_average = average(test_scores_norm)
y_by_average[1:10]
Out[204]:
array([0.43867527, 0.83656159, 0.83254772, 0.51393274, 0.67880553,
       0.70478619, 1.34119988, 0.59001519, 0.90650996])

Plotting the scores

In [205]:
fig, ax = plt.subplots(figsize= (20,6))
plt.plot(y_by_average);
plt.axhline(y=clf3.threshold_, c='r', ls='dotted', label='threshoold');
plt.title('Anomaly Distances with automatically calculated threshold, for Model iForest Clf3');
In [206]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_by_average, bins=200) # arguments are passed to np.histogram
plt.title("Combination by average, for model iForest, with 200 bins")
plt.xlabel('Distances')
plt.show()

In the above graph, it is very difficult to see the distribution, as the range is very vast. So, I will cut the range at 8.0 and then re-plot

In [207]:
y_by_average_small = y_by_average[y_by_average[:, ] < 8.0]
y_by_average_small
Out[207]:
array([ 0.58351696,  0.43867527,  0.83656159, ..., -0.45167965,
        0.4361085 ,  0.29024326])
In [208]:
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_by_average_small, bins = 'auto')  # arguments are passed to np.histogram
plt.title("Combination by average, for model iForest, with auto bins")
plt.xlabel('Distances')
plt.show()

Build a 3 cluster model

In [209]:
df_test = pd.DataFrame(X)
df_test['distance'] = y_by_average

df_test['cluster'] = (np.where(df_test['distance'] > 6.0, 3,
                                        (np.where(df_test['distance'] > 3.0, 2, 1))))

df_test['cluster'].value_counts()
Out[209]:
1    159847
2      2801
3       417
Name: cluster, dtype: int64

Percentage of Data points in each cluster

In [210]:
temp = pd.DataFrame()

temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']

temp
Out[210]:
Percentage of total Cluster
1 98.026554 1
2 1.717720 2
3 0.255726 3
In [211]:
fig = px.bar(temp, x = 'Cluster', y = 'Percentage of total', color = 'Cluster',
             width=600, height=400,
            title = "Percentage of total in each cluster")
fig.show()

Final Summary statistics

In [212]:
cluster_df = df_test.groupby('cluster').mean().reset_index()
cluster_df
Out[212]:
cluster Average_Total_Payments Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Median_Score Total_Disc_by_Pop distance
0 1 -0.049847 0.023043 0.000223 -0.070466 -0.037012 -0.023717 -0.083861
1 2 2.260479 -0.966112 -0.012421 2.769039 1.494202 0.710808 3.937379
2 3 3.924005 -2.343696 -0.002039 8.411870 4.150888 4.316983 7.682231
In [213]:
cluster_df.plot(x = 'cluster', y = 'Out_of_Pocket_Payment', color = 'green', kind = 'bar')
Out[213]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e9e5bc250>
In [214]:
cluster_df.plot(x = 'cluster', y ='Total_Disc_by_Pop' , color = 'red', kind ='bar')
Out[214]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7eadb8ad90>
In [215]:
cluster_df.plot(x = 'cluster', y ='distance' , color = 'blue', kind ='bar')
Out[215]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e9db5af50>

Observation:

From the above 3 cluster analysis, the following can be explained:

A. Clusters 2, and 3 have a very small percentage of total data in their cluster pool. So, there are two possible options for these clusters:

  • Either they are true anomalies
  • They are provider of rare services, or very expensive operations, surgeries or services, performed only by very limited number of providers in the country.

B. Cluster 3 has a roughly 0.24%, and Cluster 2 has roughly 1.64% of the data in its cluster pool. These are the critical cut-off points, as I believe anything near or less than 10% of the entire data, represents a different but normal cluster group.

C. The other clusters which have over 10% of the total data in their cluster pools, seem normal and fine. The differences lie probably due to state or the drg definition or the state. But they are not not likely to be a cause for anomalies.

ANOMALIES or SUSPICIOUS clusters:

The above table gives a comprehensive view of the means, as per the different clusters which is useful to identofy anomalies.

From the final 6 features used for clustering, I generated 3 different clusters using the aggregated average method. The following are the key takeaways for anomalies and suspicious clusters:

Total Discharges by Population

  1. Clusters 3 has an extremely high value, away from the normal mean of other clusters. Cluster 3 has a mean almost 4-5 times the mean of other clusters, which is very a big concern, even though this cluster has like 0.24% of the total data points.

Average Total Payments

  1. Cluster 2 and 3 have extremely high values. Cluster 3, which has around 0.24% of the total data points, has a mean of 4, which is over 4-5 times the means of the other clusters. Now, cluster 2 also has double the mean, but this might just be the expensive hospitals, or other life saving treatments. Compared to the cluster 1, cluster 3 emerges anomalous.

Out of Pocket Payment

  1. Cluster 3 has extremely high values. Cluster 3, which has around 0.24% of the total data points, has a mean of over 8, which is over 6-7 times the means of the other clusters.

Median Score

  1. Cluster 3 has the highest mean amongst all other clusters means, 3-4 times higher than all other clusters.

From the above observations, it is clear that there is 1 distinct cluster, which has a very high cluster mean for a lot of variables, if not all, and emerges as suspicious cases or has anomalies. This is cluster 3.

Combining these results with the Percentage results obtained in the previous step, I can concule that Clusters 3 issuspicious and need further investigation.

BUSINESS INSIGHTS

I performed the Autoencoder clustering for the data, and also used the 'Average' aggregate method to build a stable model. On building the average model, I found that the original model I built was stable and had good insights.

My analysis gave me 3 clusters, and out of those 3, Cluster 3 seems to be the one which has a lot a potential anomalies, and is suspicious because the means for the variables, as explained above, are far away from the means of the variables of other clusters.

Further, on evaluating the distance for all the clusters, which gives us insights about the clusters which are anomalies, as the anomalies might have a very very high score comapred to others, I see that cluster 3 has a score almost 5-6 times higher than all other clusters. So, I can safely conclude that Cluster 3 is highly suspicious.

So, I would pass on the 395 specific entries of the Cluster 3 to the relevant authorities, and call for further investigation on each of the entries, to understand of they are true anomalies. I will provide all the reasoning as I have highlighted above, as to the differences in the means, and walk through the process I have done.

Conclusion:

Autoencoder and iForest are are very good models to identify cluster who might be suspicious or have anomalies. For performing the analysis, I used 3 clusters as the optimum number of clusters. An important observation was that:

  • Even though some clusters have avery high in cluster mean, but they contaian a good portion of the total data. Hence we cannot call that cluster as suspicious. So, finally, per my anslysis, I can say that clusters 4 and maybe 3, in both models were suspicious, as they contained fewer than 5% of the total data points, and also, they had high in-cluster means for a lot of variables, which makes them very suspicious.